×
Namespaces

Variants
Actions
(Difference between revisions)

Creating and optimizing a custom effect for the Nokia Imaging SDK

From Nokia Developer Wiki
Jump to: navigation, search
SB Dev (Talk | contribs)
m (SB Dev - - Introduction)
SB Dev (Talk | contribs)
(SB Dev -)
Line 739: Line 739:
  
 
Aside from the fact that the release code is a lot faster (especially the native code) we can see immediately that some of our optimization steps that are not moving code from C# to C++ or from pure C++ to C++/NEON are not having much of an effect. This is due to the compiler optimizations. What you can also see is that the compiler is not able to protect us from issues like the additional memory copy we did during our first attempt at utilizing the NEON instructions. The last thing you might notice is that percentage wise the gains seem to be much smaller in the release configuration than they were in the debug configuration. This can be explained if you keep in mind that we're not measuring our effect's execution itself but the time for the whole rendering pipeline sourrounding it (it's mainly copying data into our source buffer and out of our target buffer). The time it takes for the SDK to perform those tasks isn't affected by our change in configuration as we're always linking to release code of the SDK. This time however is now making up a bigger percentage of the time the rendering pipeline takes to execute.
 
Aside from the fact that the release code is a lot faster (especially the native code) we can see immediately that some of our optimization steps that are not moving code from C# to C++ or from pure C++ to C++/NEON are not having much of an effect. This is due to the compiler optimizations. What you can also see is that the compiler is not able to protect us from issues like the additional memory copy we did during our first attempt at utilizing the NEON instructions. The last thing you might notice is that percentage wise the gains seem to be much smaller in the release configuration than they were in the debug configuration. This can be explained if you keep in mind that we're not measuring our effect's execution itself but the time for the whole rendering pipeline sourrounding it (it's mainly copying data into our source buffer and out of our target buffer). The time it takes for the SDK to perform those tasks isn't affected by our change in configuration as we're always linking to release code of the SDK. This time however is now making up a bigger percentage of the time the rendering pipeline takes to execute.
 +
 +
== When to optimize ==
 +
 +
Having read all this you might now ask yourself: when do I start optimizing my code? When do I stop? In my opinion it is best to start to implement your custom effect in a straight forward, easy to understand way. Then you should test if it is fast enough. If it is I don't see much reason to start putting a lot of effort into optimization. However as you have seen it is quite easy to move your code from C# to C++ and you get quite a big boost in performance by doing so. So this is something you should always consider doing. Moving to SIMD instructions or applying other (perhaps lossy) optimizations however is something I suggest you don't do unless your use case requires it (e.g. realtime camera feed preview) due to the fact that is easy to spend considerable amounts of time on it.
  
 
== Summary ==
 
== Summary ==

Revision as of 19:57, 22 December 2013

This article explains how to create a custom effect for the Nokia Imaging SDK, as well as optimizing it's performance by moving it to native code and providing guidance on how to transform an existing algorithm to take advantage of SIMD instructions like ARM NEON. We'll also be building a small test bench application to compare the optimized versions with the reference implementation.

Note.pngNote: This is an entry in the Nokia Imaging and Big UI Wiki Competition 2013Q4.

WP Metro Icon Graph1.png
WP Metro Icon WP8.png
Article Metadata
Code ExampleTested with
SDK: Windows Phone 8.0 SDK
Devices(s): Nokia Lumia 1020, Nokia Lumia 1520, Samsung Ativ S
Compatibility
Platform(s):
Windows Phone 8
Dependencies: Nokia Imaging SDK 1.0
Article
Created: SB Dev (18 Nov 2013)
Last edited: SB Dev (22 Dec 2013)

Contents

Introduction

We will be using a variant of the DoubleEffect that Nokia is providing as part of the Imaging SDK documentation as the starting point of our discussion. It is called the MultiplyEffect and it's use is to multiply each color component with a multiplication factor provided to the Effect as a parameter. The alpha channel will not be altered by the effect. The reason for using such a simple effect is to keep the resulting code simple enough to make it easy to understand what we are doing in each step.

Our first implementation will be done in C# to take advantage of the Nokia.Graphics.Imaging.CustomEffectBase base class provided by the SDK. It will serve as our reference implementation, which we will use to check if the results provided by our optimized filters are the same. In the next step we will reimplement the MultiplyEffect as a WinPRT component written in C++/Cx. We will have to do without the help of the CustomEffectBase and instead will have to look into how to implement the interface Nokia::Graphics::Imaging::ICustomEffect.

Having two implementations of the MultiplyEffect we will talk about how to compare those two. While we will not look into the implementation of the comparer (it's quite easy to read the code provided as part of the sample project) we will talk about what kind of information it outputs and why that information is helpful when trying to optimize an imaging effect. The last part will deal with techniques to transform your code in such a way that it gets easier to optimize using Vector Processing and look into examples of doing so using ARM NEON intrinsics. The progression of the performance through all those steps is shown in the following diagram. You should keep in mind though that those are benchmarking numbers applicable to the MultiplyEffect on the tested devices. Speedup may be bigger or smaller depending on the effect you implement and for some of those effects not only depend on the size of the processed image (as is the case here) but also the contents.

CustomEffectSample perf.png

The rest of the article will assume that you have at least a basic understanding of both C# and C++ code. I will try to provide pointers to useful resources that outline special behaviors or functionality required to understand the code. As the article aims to outline possible steps to get better performance instead of trying to just showcase optimized code the resulting code still has potential for more optimization.

If you have not yet worked with the Imaging SDK you can find instructions on how to install it in your project in Nokia's documentation: Download and add the libraries to the project. If you have multiple projects in your solution you will be asked to select the ones that will make use of the SDK. Sometimes compiling against the SDK will not work immediately after download in which case a restart of Visual Studio fixes the problem.

Note.pngNote: The sample solution contains ARM NEON intrinsics. The resulting code can only run on ARM CPUs which will prevent it from running in the emulator. Therefore the affected code has been placed in a separate project called CustomEffectNativeNeonOptimized which can easily be removed from the solution to test the rest of the sample without having to use an actual device.

Reference Implementation in C#

A custom effect in C# can easily be created by making your effect class inherit from Nokia.Graphics.Imaging.CustomEffectBase. Only the OnProcess method needs to be implemented to create a new effect. It's parameters give you direct access to the pixel data that you are supposed to alter (depending on the settings of the imaging pipeline your code is not always supposed to alter the whole image). We've defined a helper method called MultiplyBounded which does the actual calculation for each color component of a pixel while making sure that the resulting value is no larger than 255 (we only have 8 bit to represent each color component). We also want the user of our MultiplyEffect to set a multiplication factor which has to be provided to the effect's constructor.

using System;
using Windows.UI;
using Nokia.Graphics.Imaging;
 
namespace CustomEffectManaged
{
public class MultiplyEffect : CustomEffectBase
{
private byte multiplier;
 
public MultiplyEffect(IImageProvider source, byte multiplier) : base(source)
{
this.multiplier = multiplier;
}
 
protected override void OnProcess(PixelRegion sourcePixelRegion, PixelRegion targetPixelRegion)
{
sourcePixelRegion.ForEachRow((index, width, pos) =>
{
for (int x = 0; x < width; ++x, ++index)
{
Color c = ToColor(sourcePixelRegion.ImagePixels[index]);
c.R = MultiplyBounded(c.R);
c.G = MultiplyBounded(c.G);
c.B = MultiplyBounded(c.B);
targetPixelRegion.ImagePixels[index] = FromColor(c);
}
});
}
 
private byte MultiplyBounded(byte value)
{
return (byte)Math.Min(255, value * multiplier);
}
}
}

The code of the C# implementation of our MultiplyEffect can be found inside the CustomEffectManaged project included with the sample solution.

Basic Implementation in C++

Creating a custom effect in C++ is a little more complicated than in C#. You can't directly inherit from a base class and only add the code required to transform the input to the output. Instead you have to make your WinRT component implement the interface Nokia::Graphics::Imaging::ICustomEffect. In order to not have you implement everything yourself the resulting component is not an effect in it's own right but you have to wrap it inside a Nokia.Graphics.Imaging.DelegatingEffect to execute it.

The following code is the basic template you need to start working on a custom filter using native code. Aside from the class/namespace names of the implementation you can copy/paste it to start work on your own custom effect written in native code.

Header file:

// MultiplyEffect.h
#pragma once
 
namespace CustomEffectNative
{
public ref class MultiplyEffect sealed : public Nokia::Graphics::Imaging::ICustomEffect
{
public:
MultiplyEffect();
virtual Windows::Foundation::IAsyncAction^ LoadAsync();
virtual void Process(Windows::Foundation::Rect rect);
virtual Windows::Storage::Streams::IBuffer^ ProvideSourceBuffer(Windows::Foundation::Size imageSize);
virtual Windows::Storage::Streams::IBuffer^ ProvideTargetBuffer(Windows::Foundation::Size imageSize);
private:
unsigned int imageWidth;
Windows::Storage::Streams::Buffer^ sourceBuffer;
Windows::Storage::Streams::Buffer^ targetBuffer;
};
}

Implementation:

// MultiplyEffect.cpp
#include "pch.h"
#include <wrl.h>
#include <robuffer.h>
#include "MultiplyEffect.h"
#include <ppltasks.h>
#include "Helpers.h"
 
using namespace CustomEffectNative;
using namespace Platform;
using namespace concurrency;
using namespace Nokia::Graphics::Imaging;
using namespace Microsoft::WRL;
 
MultiplyEffect::MultiplyEffect()
{
}
 
Windows::Foundation::IAsyncAction^ MultiplyEffect::LoadAsync()
{
return create_async([this]
{
//add your initialisation logic here
});
}
 
void MultiplyEffect::Process(Windows::Foundation::Rect rect)
{
//code implementing the actual effect will go here
}
 
Windows::Storage::Streams::IBuffer^ MultiplyEffect::ProvideSourceBuffer(Windows::Foundation::Size imageSize)
{
unsigned int size = (unsigned int)(4 * imageSize.Height * imageSize.Width);
sourceBuffer = ref new Windows::Storage::Streams::Buffer(size);
sourceBuffer->Length = size;
imageWidth = (unsigned int)imageSize.Width;
return sourceBuffer;
}
 
Windows::Storage::Streams::IBuffer^ MultiplyEffect::ProvideTargetBuffer(Windows::Foundation::Size imageSize)
{
unsigned int size = (unsigned int)(4 * imageSize.Height * imageSize.Width);
targetBuffer = ref new Windows::Storage::Streams::Buffer(size);
targetBuffer->Length = size;
return targetBuffer;
}

Note.pngNote: The above code frequently shows the hat-symbol '^' at the end of a variable name. This signifies that the value is a reference counted WinRT component.Those components are declared as "public ref class <classname> sealed" and instantiated using "ref new" instead of "new". A more detailed description of how this works in detail is outside the scope of this article but it should be sufficient to know that when you declare a class as WinRT component the Runtime will make it available for use from all WinRT languages which includes the Managed languages C# and Visual Basic. A good introduction of these extensions to C++ called C++/Cx can be found in the article A Tour of C++/CX although the samples there deal with Windows 8 instead of Windows Phone 8.

As you can see we have to do quite a bit more work than in C#. First we have to provide Buffers to the SDK that will be used to store the input and output data. You might have noticed that as part of the class declaration we have added a field called imageWidth. While the calculation of our MultiplyEffect does not need knowledge on any pixel except for the current one many other effects do. To be able to calculate e.g. a pixel above the current one the width needs to be retained. We will also need it only work with the pixels inside the buffers that we were asked to alter (as was already mentioned when talking about the sourcePixelRegion in C#).

Another method that we have to implement as part of the interface but won't be needing in this example is LoadAsync. You should use that method to e.g. preload/precompute assets that don't depend on the input image. One example would be if you were to blend the image you receive from the rendering pipeline with another image. The other image would be known to your effect as a parameter and you could load that image into memory when LoadAsync is called, so it's immediately available when you need it for the actual rendering.

Inspecting the Buffers that we are creating you might notice that their interface does not provide any access to the encapsulated raw data. To gain access to it you will have to perform some recasting of COM objects. Fortunately Microsoft is providing the necessary code for it in the article Obtaining pointers to data buffers (C++/CX). I have placed the described function in a separate helper namespace as it's used in several different versions of the MultiplyEffect.

// Helpers.cpp
#include "pch.h"
#include <wrl.h>
#include <robuffer.h>
#include "Helpers.h"
#include <ppltasks.h>
 
using namespace Microsoft::WRL;
 
//method definition from http://msdn.microsoft.com/en-us/library/Windows/Apps/dn182761.aspx
byte* CustomEffectNative::Helpers::GetPointerToPixelData(Windows::Storage::Streams::IBuffer^ pixelBuffer, unsigned int *length)
{
if (length != nullptr)
{
*length = pixelBuffer->Length;
}
// Query the IBufferByteAccess interface.
ComPtr<Windows::Storage::Streams::IBufferByteAccess> bufferByteAccess;
reinterpret_cast<IInspectable*>( pixelBuffer)->QueryInterface(IID_PPV_ARGS(&bufferByteAccess));
 
// Retrieve the buffer data.
byte* pixels = nullptr;
bufferByteAccess->Buffer(&pixels);
return pixels;
}

Having all that support code in place we can now again concentrate on how to implement the actual effect. The implementation of the methods mentioned in the C++ section so far will stay the same for all examples shown in the rest of this article.

void MultiplyEffect::Process(Windows::Foundation::Rect rect)
{
unsigned int sourceLength, targetLength;
byte* sourcePixels = CustomEffectNative::Helpers::GetPointerToPixelData(sourceBuffer, &sourceLength);
byte* targetPixels = CustomEffectNative::Helpers::GetPointerToPixelData(targetBuffer, &targetLength);
 
unsigned int minX = (unsigned int)rect.X * 4;
unsigned int minY = (unsigned int)rect.Y;
unsigned int maxX = minX + (unsigned int)rect.Width * 4;
unsigned int maxY = minY + (unsigned int)rect.Height;
 
for(unsigned int y = minY; y < maxY; y++)
{
unsigned int xOffset = y * imageWidth * 4;
for(unsigned int x = minX; x < maxX; x += 4)
{
//Imaging SDK uses Blue, Green, Red, Alpha Image Format with 8 bits/channel
byte b = sourcePixels[xOffset + x];
byte g = sourcePixels[xOffset + x + 1];
byte r = sourcePixels[xOffset + x + 2];
byte a = sourcePixels[xOffset + x + 3];
 
b = MultiplyBounded(b);
g = MultiplyBounded(g);
r = MultiplyBounded(r);
 
targetPixels[xOffset + x] = b;
targetPixels[xOffset + x + 1] = g;
targetPixels[xOffset + x + 2] = r;
targetPixels[xOffset + x + 3] = a;
}
}
}
 
byte MultiplyEffect::MultiplyBounded(byte value)
{
return (byte)min(255, ((int)value) * multiplier);
}

The implementation of the effect is still pretty straight forward. First we will use the GetPointerToPixelData helper method to extract the actual data arrays from the Buffers. Next we will calculate the bounds of the area we are supposed to alter so we can use them in the loops we use to move through the image data. One pixel consists of 4 bytes of data (Blue, Green, Red and Alpha), so we will have to multiply all width related values by 4 as the image width is given in pixels. Moving through the image along the x-Axis needs to be done using +4 increments for the same reason. We're extracting the data for all color channels to separate variables, apply the calculations to each of them using MultiplyBounded and finally store them in the output buffer - pretty much the same as in the C# version but we can't use the ToColor/FromColor helper methods so I opted to do it without using Color structs as well.

Note.pngNote: Regular C++ programmers might have noticed that the sample code makes use of the datatype byte which does not exist in C++. It is a custom typedef of the Windows Runtime and can safely be exchanged for unsigned char in your own code.

The full code of the C++ implementation of our MultiplyEffect can be found inside the CustomEffectNative project included with the sample solution.

Comparing Effects

Screenshot

The main view of our sample application is designed to help you with testing and comparing different versions of your custom effect. The first page of the panorama holds a button allowing to select an image from the media library. Doing so will apply two effects to the selected image. The first is our reference, meaning the one filter we built to do what we set out to do and the comparison effect that is supposed to do the same but faster. The result of the reference filter will be shown above the button, the result of the comparison below. Those mainly serve to see potential issues at a glance.

Performance

The second page is likely to be of most interest when dealing with effect optimization. It shows the time it took to execute the reference effect and the comparison effect in Milliseconds. It also shows some information on the size of the image. It's important to note that those values will vary not only from device to device but also will vary in between consecutive runs on the same device to a certain degree. To really compare your changes you will have to run the test a few times. The following table shows the minimum and maximum time it took to run our two effects implemented so far on a Lumia 1020 as well as the minimum time it took them to execute on a Lumia 1520. The image being used has a size of 1188 by 1188 pixels and is included with the sample project.

Effect Lumia 1020 max Lumia 1020 min Lumia 1520 min
Managed (C#) 745 ms 708 ms 346 ms
Native (C++) 370 ms 358 ms 238 ms

We're doing the performance measurement inside BitmapComparer class of the sample project, which in turn builds a minimal imaging pipeline for the image and effects provided to it's Compare method. The time measured is the time it takes to complete the RenderAsync call while rendering the provided Bitmap using the provided effects to another Bitmap of matching dimensions. This is the closest we can get to the time it takes for the actual effect to run without putting the benchmarking logic directly inside the effect.

Usage of the BitmapComparer for your own effects is quite easy. You only have to instantiate your effects (wrapping native effects inside a DelegatingEffect) and call the Compare method. As our custom effects are WinRT components you can simply instantiate them as you would any managed class (the Windows Phone Runtime will take care of all marshaling, etc. behind the scenes ). The best place to instantiate your effects is in the DoComparison method of the ComparisonPage included in the sample project. All effects developed as part of this article are already kept there to make testing everything out easy.

private async void DoComparison(System.IO.Stream imageStream)
{
Bitmap bmp = null;
using (StreamImageSource sis = new StreamImageSource(imageStream))
{
bmp = await sis.GetBitmapAsync(null, OutputOption.PreserveAspectRatio);
 
CustomEffectBase referenceEffect = new CustomEffectManaged.MultiplyEffect(sis, 2);
DelegatingEffect comparisonEffect = new DelegatingEffect(new CustomEffectNative.MultiplyEffect(2));
//DelegatingEffect comparisonEffect = new DelegatingEffect(new CustomEffectNative.MultiplyEffectInlined(2));
//DelegatingEffect comparisonEffect = new DelegatingEffect(new CustomEffectNativeNeonOptimized.MultiplyEffectInlinedOptimized(2));
//DelegatingEffect comparisonEffect = new DelegatingEffect(new CustomEffectNative.MultiplyEffectInlinedUnrolled(2));
//DelegatingEffect comparisonEffect = new DelegatingEffect(new CustomEffectNativeNeonOptimized.MultiplyEffectInlinedUnrolledOptimized(2));
//DelegatingEffect comparisonEffect = new DelegatingEffect(new CustomEffectNativeNeonOptimized.MultiplyEffectInlinedUnrolledOptimizedDeviating(2))
//DelegatingEffect comparisonEffect = new DelegatingEffect(new CustomEffectNativeNeonOptimized.MultiplyEffectInlinedUnrolledOptimizedDeviating128(2))
Comparer.Compare(bmp, referenceEffect, comparisonEffect);
}
}

Rendering Deviations

The third page is used to check wether your optimized effect is producing the same output as the reference effect as well as helping you to check how different it is using two different metrics. What you will be aiming for is to have all values show 0, which means that there is no deviation from the reference effect.

The two metrics being used are the maximum deviation being encountered during the comparison as well as the mean deviation. As you're most often processing the color channels separately these are given for each of the channels (A = Alpha, R = Red, G = Green, B = Blue) as well as combined for all color components. Having a high value for Max Deviation but a low value for the Mean Deviation means that the comparison effect is producing occasional pixels that are far from the reference rendering but it's doing a fine job for the majority of pixels. Meanwhile a high Mean Deviation means that your effect is overall quite far off from the reference. It is not possible to get a higher Mean Deviation than Max Deviation.

Sometimes you can sacrifice some accuracy to gain a significant speedup in an effect. One thing worth noting though is that the deviations calculated depend on the input image as well as the parameters you use in your effect. It is entirely possible to get no deviation at all for one input image or a certain set of input parameters while having a high deviation for another. You should therefore use multiple images in your testing efforts or ones specially designed to make your code enter the paths that are likely to lead to a deviation.

If you want to see a practical example of how differences show up in an image and the deviation numbers you could remove the processing logic for one of the color components from one of the sample effects.

b = MultiplyBounded(b);
//g = MultiplyBounded(g);
r = MultiplyBounded(r);

This alteration will exclude the green color component from processing keeping it the same as in the original image. Therefore only the green channel will show deviations (as well as the overall deviation).

Optimization using SIMD/ARM NEON

ARM NEON is a SIMD instruction set available on all Windows Phone 8 devices. A good introduction to what it is and how it works can be found in this article WP8: Optimizing your signal processing algorithms using NEON. We will now look at possibilities to rewrite your code in such a way that it gets easier to spot possibilities to use SIMD instructions.

Function Inlining

Function Inlining is an optimization technique commonly applied by compilers. What it comes down to is simply taking the code of a function/method and copying it in place of the function calls. This can bring performance gains if you have a very short function that is called very often in a loop (e.g. doing a calculation on each pixel in an image), due to avoiding the function call overhead. We'll now inline our MultiplyBounded method to make it easier to see which operations are perforemd on each color component. The result can be seen below.

  1. void MultiplyEffectInlined::Process(Windows::Foundation::Rect rect)
  2. {
  3. 	unsigned int sourceLength, targetLength;
  4. 	byte* sourcePixels = CustomEffectNative::Helpers::GetPointerToPixelData(sourceBuffer, &sourceLength);
  5. 	byte* targetPixels = CustomEffectNative::Helpers::GetPointerToPixelData(targetBuffer, &targetLength);
  6.  
  7. 	unsigned int minX = (unsigned int)rect.X * 4;
  8. 	unsigned int minY = (unsigned int)rect.Y;
  9. 	unsigned int maxX = minX + (unsigned int)rect.Width * 4;
  10. 	unsigned int maxY = minY + (unsigned int)rect.Height;
  11.  
  12. 	for(unsigned int y = minY; y < maxY; y++)
  13. 	{
  14. 		unsigned int xOffset = y * imageWidth * 4;
  15. 		for(unsigned int x = minX; x < maxX; x += 4) 
  16. 		{
  17. 			//Imaging SDK uses Blue, Green, Red, Alpha Image Format with 8 bits/channel
  18. 			byte b = sourcePixels[xOffset + x];
  19. 			byte g = sourcePixels[xOffset + x + 1];
  20. 			byte r = sourcePixels[xOffset + x + 2];
  21. 			byte a = sourcePixels[xOffset + x + 3];
  22.  
  23. 			//inlined code from MultiplyBounded
  24. 			b = (byte)min(255, ((int)b) * multiplier);
  25. 			g = (byte)min(255, ((int)g) * multiplier);
  26. 			r = (byte)min(255, ((int)r) * multiplier);
  27.  
  28. 			targetPixels[xOffset + x] = b;
  29. 			targetPixels[xOffset + x + 1] = g;
  30. 			targetPixels[xOffset + x + 2] = r;
  31. 			targetPixels[xOffset + x + 3] = a;
  32. 		}
  33. 	}
  34. }

It is now easy to see that we're doing exactlly the same operations on all 3 color components. We first multiply them and then perform a min operation. Fortunately there is also a min operation available as part of the ARM NEON instruction set. We will have to do the following changes to make our code take advantage of the ARM NEON instructions. First we declare a vector register holding our multipliers. It's made up of four 16 bit values. The first 3 will be the multiplier that we have been using so far. The last one will always be one. This is because we will store the alpha value in the fourth position and we don't want to alter that value. We will also declare a similar vector for the min-comparison. Those two are both defined outside the calculation loops as their values will not change during the computation.

Inside the calculation loops we will first store our color components into a vector holding 4 16bit values as well. We will then do the multiplication and min operation on the using ARM NEON intrinsics. Finally we store the resulting data back into the array we used initially to load it. From there we write the result to the target buffer.

void MultiplyEffectInlinedOptimized::Process(Windows::Foundation::Rect rect)
{
unsigned int sourceLength, targetLength;
byte* sourcePixels = CustomEffectNativeNeonOptimized::Helpers::GetPointerToPixelData(sourceBuffer, &sourceLength);
byte* targetPixels = CustomEffectNativeNeonOptimized::Helpers::GetPointerToPixelData(targetBuffer, &targetLength);
 
unsigned int minX = (unsigned int)rect.X * 4;
unsigned int minY = (unsigned int)rect.Y;
unsigned int maxX = minX + (unsigned int)rect.Width * 4;
unsigned int maxY = minY + (unsigned int)rect.Height;
 
//define an array to hold our multipliers
uint16_t multArr[4] = { multiplier, multiplier, multiplier, 1 };
//load the data into a NEON register
uint16x4_t regMult = vld1_u16(multArr);
 
//define an array to hold our min comparison values
uint16_t minArr[4] = { 255, 255, 255, 255 };
//load the data into a NEON register
uint16x4_t regMin = vld1_u16(minArr);
 
for(unsigned int y = minY; y < maxY; y++)
{
unsigned int xOffset = y * imageWidth * 4;
for(unsigned int x = minX; x < maxX; x += 4)
{
//Imaging SDK uses Blue, Green, Red, Alpha Image Format with 8 bits/channel
byte b = sourcePixels[xOffset + x];
byte g = sourcePixels[xOffset + x + 1];
byte r = sourcePixels[xOffset + x + 2];
byte a = sourcePixels[xOffset + x + 3];
 
//define an array with our pixel data
uint16_t pixel[4] = { b, g, r, a };
//load the data into a NEON register
uint16x4_t regPixel = vld1_u16(pixel);
//perform multiplication into pixel register
regPixel = vmul_u16(regPixel, regMult);
//perform min comparison into pixel register
regPixel = vmin_u16(regPixel, regMin);
//restore pixel data to pixel array
vst1_u16(pixel, regPixel);
 
targetPixels[xOffset + x] = (byte)pixel[0];
targetPixels[xOffset + x + 1] = (byte)pixel[1];
targetPixels[xOffset + x + 2] = (byte)pixel[2];
targetPixels[xOffset + x + 3] = (byte)pixel[3];
}
}
}

We have now done our first optimization step moving sequential code to parallel code. Doing another performance test will give us similar results to the following.

Effect Lumia 1020 max Lumia 1020 min Lumia 1520 min
Managed (C#) 745 ms 708 ms 346 ms
Native (C++) 370 ms 358 ms 238 ms
Inlined (C++) 274 ms 241 ms 150 ms
Inlined NEON (C++) 381 ms 369 ms 216 ms

Inlining the code has given us a nice speedup compared to the regular C++ version we had before. Using the NEON instructions however hasn't given us a performance boost rather our effect is performing worse. The reason is that our current NEON implementation requires a copy from memory to memory in order to load the 8bit color channels as 16bit values. Given how simple the actual calculation is the speed gains there don't compensate for the data copying. We also can not load the data directly from the input buffer and use NEON instructions to do the conversion from 8bit to 16bit, because the only instructions allowing us to load 8bit values require 8 of them (vector size is 64bit or 128bit), while we only have 4 in each loop iteration. We therefore have to keep looking for additional potential to a) do more parallel computation or b) find more values so we can load 8 of them at once.

Note.pngNote: We need to use 16bit values for the multiplication step because multiplying an 8bit value by another 8bit value might lead to an overflow if the target value is also 8bit wide. Following the min operation however all values are normalized to fit inside 8bit fields again.

Loop Unrolling

Loop unrolling means that you merge several iterations of a given loop into one. You can merge a abritrary of loop iterations. The number of iterations that you merge is called the unroll factor. You can find a more in depth description of the unrolling process itself in this article: Wikipedia: Loop unwinding.

regular loop unrolled loop - factor 2 unrolled loop - factor 3
//regular loop
for (int i = 0; i < 12; i++)
{
res[i] = src[i] * multiplier;
}
//unrolled loop - factor 2
for(int i = 0; i < 12; i += 2)
{
res[i] = src[i] * multiplier;
res[i + 1] = src[i + 1] * multiplier;
}
//unrolled loop - factor 3
for(int i = 0; i < 12; i += 3)
{
res[i] = src[i] * multiplier;
res[i + 1] = src[i + 1] * multiplier;
res[i + 2] = src[i + 2] * multiplier;
}

If we unroll our inner loop of the inlined version by a factor of 2 we get the following result. Please note that we're now incrementing the counter variable x by 8 instead of by 4 in the previous samples.

  1. void MultiplyEffectInlinedUnrolled::Process(Windows::Foundation::Rect rect)
  2. {
  3. 	unsigned int sourceLength, targetLength;
  4. 	byte* sourcePixels = CustomEffectNative::Helpers::GetPointerToPixelData(sourceBuffer, &sourceLength);
  5. 	byte* targetPixels = CustomEffectNative::Helpers::GetPointerToPixelData(targetBuffer, &targetLength);
  6.  
  7. 	unsigned int minX = (unsigned int)rect.X * 4;
  8. 	unsigned int minY = (unsigned int)rect.Y;
  9. 	unsigned int maxX = minX + (unsigned int)rect.Width * 4;
  10. 	unsigned int maxY = minY + (unsigned int)rect.Height;
  11.  
  12. 	for(unsigned int y = minY; y < maxY; y++)
  13. 	{
  14. 		unsigned int xOffset = y * imageWidth * 4;
  15. 		for(unsigned int x = minX; x < maxX; x += 8) 
  16. 		{
  17. 			byte b1 = sourcePixels[xOffset + x];
  18. 			byte g1 = sourcePixels[xOffset + x + 1];
  19. 			byte r1 = sourcePixels[xOffset + x + 2];
  20. 			byte a1 = sourcePixels[xOffset + x + 3];
  21. 			//duplicated getter code from unrolling operation
  22. 			byte b2 = sourcePixels[xOffset + x + 4];
  23. 			byte g2 = sourcePixels[xOffset + x + 5];
  24. 			byte r2 = sourcePixels[xOffset + x + 6];
  25. 			byte a2 = sourcePixels[xOffset + x + 7];
  26.  
  27. 			b1 = (byte)min(255, ((int)b1) * multiplier);
  28. 			g1 = (byte)min(255, ((int)g1) * multiplier);
  29. 			r1 = (byte)min(255, ((int)r1) * multiplier);
  30. 			//duplicated calculation code from unrolling operation
  31. 			b2 = (byte)min(255, ((int)b2) * multiplier);
  32. 			g2 = (byte)min(255, ((int)g2) * multiplier);
  33. 			r2 = (byte)min(255, ((int)r2) * multiplier);
  34.  
  35. 			targetPixels[xOffset + x] = b1;
  36. 			targetPixels[xOffset + x + 1] = g1;
  37. 			targetPixels[xOffset + x + 2] = r1;
  38. 			targetPixels[xOffset + x + 3] = a1;
  39. 			//duplicated setter code from unrolling operation
  40. 			targetPixels[xOffset + x + 4] = b2;
  41. 			targetPixels[xOffset + x + 5] = g2;
  42. 			targetPixels[xOffset + x + 6] = r2;
  43. 			targetPixels[xOffset + x + 7] = a2;
  44. 		}
  45. 	}
  46. }

Our unrolled loop is now processing 2 pixels at a time. 2 pixels means 8 8bit color components. This means that we can now load a whole 64bit vector from the source buffer using a NEON load operation avoiding the memory to memory copy we had to resort to in our previous optimization attempt. Our two registers containing the necessary operands for the multiplication consist now of 8 16bit values instead of 4. Inside the calculation loop we will a movl operation to convert our 64bit vector containg the 8 8bit input values into a 128bit vector containing 8 16bit values. This vector is now first multiplied and then it's separate values are bounded to 255 using a min operation. We can now use a casting operation to convert it back to a 64bit vector again. Lastly we write that vector directly to the output buffer.

void MultiplyEffectInlinedUnrolledOptimized::Process(Windows::Foundation::Rect rect)
{
unsigned int sourceLength, targetLength;
byte* sourcePixels = CustomEffectNativeNeonOptimized::Helpers::GetPointerToPixelData(sourceBuffer, &sourceLength);
byte* targetPixels = CustomEffectNativeNeonOptimized::Helpers::GetPointerToPixelData(targetBuffer, &targetLength);
 
unsigned int minX = (unsigned int)rect.X * 4;
unsigned int minY = (unsigned int)rect.Y;
unsigned int maxX = minX + (unsigned int)rect.Width * 4;
unsigned int maxY = minY + (unsigned int)rect.Height;
 
//define an array to hold our multipliers
uint16_t multArr[8] = { multiplier, multiplier, multiplier, 1, multiplier, multiplier, multiplier, 1 };
//load the data into a NEON register
uint16x8_t regMult = vld1q_u16(multArr);
 
//define an array to hold our min comparison values
uint16_t minArr[8] = { 255, 255, 255, 255, 255, 255, 255, 255 };
//load the data into a NEON register
uint16x8_t regMin = vld1q_u16(minArr);
 
for(unsigned int y = minY; y < maxY; y++)
{
unsigned int xOffset = y * imageWidth * 4;
for(unsigned int x = minX; x < maxX; x += 8)
{
//load pixel data of two adjacent pixels into NEON register
uint8_t* pixel_8 = (uint8_t*)&sourcePixels[xOffset + x];
uint8x8_t regPixel_8 = vld1_u8(pixel_8);
//convert 8 8bit registers to 8 16bit registers
uint16x8_t regPixel_16 = vmovl_u8(regPixel_8);
//perform multiplication into pixel register
regPixel_16 = vmulq_u16(regPixel_16, regMult);
//perform min comparison into pixel register
regPixel_16 = vminq_u16(regPixel_16, regMin);
//convert 8 16bit registers to 8 8bit registers
regPixel_8 = vqmovn_u16(regPixel_16);
//store pixel data of two adjacent pixels into targetPixels
pixel_8 = (uint8_t*)&targetPixels[xOffset + x];
vst1_u8(pixel_8, regPixel_8);
}
}
}

Following another optimization step during which we were able to solve an issue we encountered during the last one it is again time to look at our performance table.

Effect Lumia 1020 max Lumia 1020 min Lumia 1520 min
Managed (C#) 745 ms 708 ms 346 ms
Native (C++) 370 ms 358 ms 238 ms
Inlined (C++) 274 ms 241 ms 150 ms
Inlined NEON (C++) 381 ms 369 ms 216 ms
Inlined Unrolled (C++) 240 ms 226 ms 146 ms
Inlined Unrolled NEON (C++) 203 ms 193 ms 99 ms

Unrolling by a factor of 2 is again giving our sequential code a slight performance boost. This time however using NEON instructions we are able to not only beat the previous NEON version but also to finally take advantage of the parallel computation to beat the sequential version.

Note.pngNote: By unrolling the loop by a factor of 2 we assume that the width of an input image is an even number. For odd widths we would always read/write one pixel to many for each row. To keep the sample code simple this case is not handled and in my experience you will rarely see images having an odd width. If you are not certain that your code won't ever have to deal with this case you should however at least check for that case and inform the user of the issue or better yet add another single loop iteration that will be executed for the last pixel in each row.

Optimizations leading to deviations in the result

So far all our optimizations were able to produce exactlly the same result as our reference effect. Sometimes however you can speed up your calculations quite a bit by using tricks that will result in a slightly different result. While that would not be acceptale in many use cases like financial data, etc. it will often not be barely noticable in image processing.

One issue with our implementation so far is that we have to convert our 8bit values to 16bit values in order to deal with the possible overflow scenario. We could however manipulate the input values of the multiplication in such a way that an overflow is no longer possible. This can be done by setting any value that would lead to an overflow to the biggest value that is not producing the overflow.

In case we used the multiplier 2 that would mean that we would set every value in the input buffer that is bigger than 127 to 127 or in general we would set any value in the input buffer that is bigger than 255 / multiplier to 255 / multiplier. Effectively we're now doing the min operation before the multiplication. The result of any number multiplied by 2 following the original min operation would have been 255. Using our optimization however the result is now 254. Using this we no longer need to move our data between 64bit and 128bit vectors, simplifying our code.

void MultiplyEffectInlinedUnrolledOptimizedDeviating::Process(Windows::Foundation::Rect rect)
{
unsigned int sourceLength, targetLength;
byte* sourcePixels = CustomEffectNativeNeonOptimized::Helpers::GetPointerToPixelData(sourceBuffer, &sourceLength);
byte* targetPixels = CustomEffectNativeNeonOptimized::Helpers::GetPointerToPixelData(targetBuffer, &targetLength);
 
unsigned int minX = (unsigned int)rect.X * 4;
unsigned int minY = (unsigned int)rect.Y;
unsigned int maxX = minX + (unsigned int)rect.Width * 4;
unsigned int maxY = minY + (unsigned int)rect.Height;
 
//define an array to hold our multipliers
uint8_t multArr[8] = { multiplier, multiplier, multiplier, 1, multiplier, multiplier, multiplier, 1 };
//load the data into a NEON register
uint8x8_t regMult = vld1_u8(multArr);
 
byte bound = 255 / multiplier;
//define an array to hold our min comparison values
uint8_t minArr[8] = { bound, bound, bound, 255, bound, bound, bound, 255 };
//load the data into a NEON register
uint8x8_t regMin = vld1_u8(minArr);
 
for(unsigned int y = minY; y < maxY; y++)
{
unsigned int xOffset = y * imageWidth * 4;
for(unsigned int x = minX; x < maxX; x += 8)
{
//load pixel data of two adjacent pixels into NEON register
uint8_t* pixel = (uint8_t*)&sourcePixels[xOffset + x];
uint8x8_t regPixel = vld1_u8(pixel);
//perform min comparison into pixel register
regPixel = vmin_u8(regPixel, regMin);
//perform multiplication into pixel register
regPixel = vmul_u8(regPixel, regMult);
//store pixel data of two adjacent pixels into targetPixels
pixel = (uint8_t*)&targetPixels[xOffset + x];
vst1_u8(pixel, regPixel);
}
}
}

We are now fully utilizing our 64bit vectors during the computation. However we would have 128bit vectors available and we have got enough data to fill them. So if we unrolled our loop by factor 4 instead of factor 2 we could load 16 8bit values at once, therefore processing 4 pixels at a time. The code for doing so is given below.

void MultiplyEffectInlinedUnrolledOptimizedDeviating128::Process(Windows::Foundation::Rect rect)
{
unsigned int sourceLength, targetLength;
byte* sourcePixels = CustomEffectNativeNeonOptimized::Helpers::GetPointerToPixelData(sourceBuffer, &sourceLength);
byte* targetPixels = CustomEffectNativeNeonOptimized::Helpers::GetPointerToPixelData(targetBuffer, &targetLength);
 
unsigned int minX = (unsigned int)rect.X * 4;
unsigned int minY = (unsigned int)rect.Y;
unsigned int maxX = minX + (unsigned int)rect.Width * 4;
unsigned int maxY = minY + (unsigned int)rect.Height;
 
//define an array to hold our multipliers
uint8_t multArr[16] = { multiplier, multiplier, multiplier, 1, multiplier, multiplier, multiplier, 1, multiplier, multiplier, multiplier, 1, multiplier, multiplier, multiplier, 1 };
//load the data into a NEON register
uint8x16_t regMult = vld1q_u8(multArr);
 
byte bound = 255 / multiplier;
//define an array to hold our min comparison values
uint8_t minArr[16] = { bound, bound, bound, 255, bound, bound, bound, 255, bound, bound, bound, 255, bound, bound, bound, 255 };
//load the data into a NEON register
uint8x16_t regMin = vld1q_u8(minArr);
 
for(unsigned int y = minY; y < maxY; y++)
{
unsigned int xOffset = y * imageWidth * 4;
for(unsigned int x = minX; x < maxX; x += 16)
{
//load pixel data of two adjacent pixels into NEON register
uint8_t* pixel = (uint8_t*)&sourcePixels[xOffset + x];
uint8x16_t regPixel = vld1q_u8(pixel);
//perform min comparison into pixel register
regPixel = vminq_u8(regPixel, regMin);
//perform multiplication into pixel register
regPixel = vmulq_u8(regPixel, regMult);
//store pixel data of two adjacent pixels into targetPixels
pixel = (uint8_t*)&targetPixels[xOffset + x];
vst1q_u8(pixel, regPixel);
}
}
}

Aside from moving to the 128bit vector operations we again have to change our increment of x. It is now 16 instead of 8. It is worth to always check this when doing loop unrolling - I forgot to do so myself while writing the sample. It is now however time for the last performance comparison table on our own optimizations to see what this deviating performance optimization has achieved.

Effect Lumia 1020 max Lumia 1020 min Lumia 1520 min
Managed (C#) 745 ms 708 ms 346 ms
Native (C++) 370 ms 358 ms 238 ms
Inlined (C++) 274 ms 241 ms 150 ms
Inlined NEON (C++) 381 ms 369 ms 216 ms
Inlined Unrolled (C++) 240 ms 226 ms 146 ms
Inlined Unrolled NEON (C++) 203 ms 193 ms 99 ms
Inlined Unrolled Deviating NEON 64 bit (C++) 184 ms 164 ms 70 ms
Inlined Unrolled Deviating NEON 128 bit (C++) 125 ms 111 ms 51 ms

As you can see sacrificing a little accuracy our last NEON implementation is now around twice as fast as the pure C++ implementation. Given how small the deviations from the reference rendering are the gains in performance might very well be worth the tradeoff (the deviations page is showing a maximum deviation of 1 with a mean deviation far smaller while using the multiplication factor of 2). Some more information on optimizations that create slight variation from the reference result while computing quite a bit faster can be found in the optimization section of this article: Image processing optimization techniques

Compiler Optimizations

As was already mentioned when introducing Function Inlining many of the techniques we use to identify potential for parallelizing code are actually employed by optimizing compilers themselves to make the code as fast as possible. So how come that we see performance differences between our non-inlined and inlined C++ code? The answer is easy - I have done all comparisons so far using Debug code, which not only is slower overall due to the debugging symbols but also effectively keeps the compiler from doing optimizations itself. The reason to do this was to showcase what effect those changes can have on their own, given that for more complex code the compiler itself might not be able to do e.g. loop unrolling (especially on nested loops) which a programmer can still do himself.

So lets have a look at how our samples perform when compiled in the release configuration.

Effect Lumia 1020 debug Lumia 1020 release Lumia 1520 release
Managed (C#) 708 ms 572 ms 244 ms
Native (C++) 358 ms 81 ms 45 ms
Inlined (C++) 241 ms 81 ms 45 ms
Inlined NEON (C++) 369 ms 94 ms 56 ms
Inlined Unrolled (C++) 226 ms 78 ms 45 ms
Inlined Unrolled NEON (C++) 193 ms 66 ms 38 ms
Inlined Unrolled Deviating NEON 64 bit (C++) 164 ms 62 ms 36 ms
Inlined Unrolled Deviating NEON 128 bit (C++) 111 ms 62 ms 36 ms

Aside from the fact that the release code is a lot faster (especially the native code) we can see immediately that some of our optimization steps that are not moving code from C# to C++ or from pure C++ to C++/NEON are not having much of an effect. This is due to the compiler optimizations. What you can also see is that the compiler is not able to protect us from issues like the additional memory copy we did during our first attempt at utilizing the NEON instructions. The last thing you might notice is that percentage wise the gains seem to be much smaller in the release configuration than they were in the debug configuration. This can be explained if you keep in mind that we're not measuring our effect's execution itself but the time for the whole rendering pipeline sourrounding it (it's mainly copying data into our source buffer and out of our target buffer). The time it takes for the SDK to perform those tasks isn't affected by our change in configuration as we're always linking to release code of the SDK. This time however is now making up a bigger percentage of the time the rendering pipeline takes to execute.

When to optimize

Having read all this you might now ask yourself: when do I start optimizing my code? When do I stop? In my opinion it is best to start to implement your custom effect in a straight forward, easy to understand way. Then you should test if it is fast enough. If it is I don't see much reason to start putting a lot of effort into optimization. However as you have seen it is quite easy to move your code from C# to C++ and you get quite a big boost in performance by doing so. So this is something you should always consider doing. Moving to SIMD instructions or applying other (perhaps lossy) optimizations however is something I suggest you don't do unless your use case requires it (e.g. realtime camera feed preview) due to the fact that is easy to spend considerable amounts of time on it.

Summary

Starting out with a very simple effect implemented in C#, that took around 700 ms to complete we have arrived at a solution that takes consistently less than a sixth of the time to complete using C++ native code and ARM NEON intrinsics. On the way we discussed how to implement effects for the Nokia Imaging SDK using C++, how to compare different versions of a filter regarding both performance and image fidelity. We've also seen two ways to rewrite our code that make it easier to spot opportunities to make use of ARM NEON instructions, as well as experiencing a situation where in order to use ARM NEON instructions we had to do conversions that were expensive enough to make the actual performance of the effect worse. Continuing on that path however we were able to build the fastest version of our effect that gives exactly the same result as the initial effect. We then implemented a change that while leading to a small change in the output allowed us to almost double performance again compared to the already fast previous version. Optimization experts might however be able to improve performance even more.

The performance improvements of our ARM NEON optimized code over the pure C++ code are quite nice but due to the simple nature of our effect which requires only few calculations while doing a lot of loads/stores from/to memory they don't approach the performance gains you can achieve on more complex effects and that you often see in discussions on the possible performance gains using SIMD instructions.

In the end we discussed how compiler optimization plays into this and why it is still worthwhile (and often necessesary) to do optimizations yourself.

Final words

If you run into issues with the samples, find a bug or know of a technique that could be applied towards optimizing our MultiplyEffect please add a comment or extend the article accordingly.

313 page views in the last 30 days.