Please note that as of October 24, 2014, the Nokia Developer Wiki will no longer be accepting user contributions, including new entries, edits and comments, as we begin transitioning to our new home, in the Windows Phone Development Wiki. We plan to move over the majority of the existing entries over the next few weeks. Thanks for all your past and future contributions.

# An Optimized Color-Blindness-Simulating Custom Effect

This article explains how to build a highly-optimized custom effect to simulate red-green colorblindness in real-time.

Note: This is an entry in the Nokia Original Imaging Effect Wiki Challenge 2014Q2

## Contents |

## Introduction

A significant number of people suffer from colorblindness and have difficulty distinguishing between red and green. In this wiki article, we will build an effect in C++ to simulate this condition to help non-colorblind people understand the condition, and then optimize the effect using NEON intrinsics. High performance is the primary goal for this effect, and this wiki article will demonstrate optimizations that do not have a significant impact on this filter but may be useful for other filters.

## Source Code

The scaffolding from this project is a combination of Nokia’s real-time filter demo and the CustomEffectSample project. I’ve updated these to use the newest version of the Imaging SDK and I’ve tried to make my code easy to use as a scaffold for other native filters. The main method of the final optimized version is given below. The full file is available on GitHub here.

void ColorBlindEffectFinalNEON::Process(Windows::Foundation::Rect rect)

{

unsigned int sourceLength, targetLength;

// Get raw pointers

byte* sourcePixels = CustomEffectNativeNeonOptimized::Helpers::GetPointerToPixelData(sourceBuffer, &sourceLength);

byte* targetPixels = CustomEffectNativeNeonOptimized::Helpers::GetPointerToPixelData(targetBuffer, &targetLength);

unsigned int minX = (unsigned int)rect.X * 4;

unsigned int minY = (unsigned int)rect.Y;

unsigned int maxX = minX + (unsigned int)rect.Width * 4;

unsigned int maxY = minY + (unsigned int)rect.Height;

uint16x8_t maxes = vdupq_n_u16(0x7FFF); // Largest number that when >> 7 is < 256.

unsigned int iwidth4 = imageWidth * 4;

unsigned int xOffset = minY*iwidth4;

if ((maxX - minX) % 32 != 0)

throw (-1);

if ((maxX - minX) < 64 != 0)

throw (-2);

for (unsigned int y = minY; y < maxY; y++)

{

unsigned int mminX = minX + xOffset;

unsigned int mmaxX = maxX + xOffset - 32;

uint8x8x4_t saveLast;

// Initialize pointers to save on calculation later

byte* spx = &sourcePixels[mminX];

byte* spminus = &targetPixels[mminX] - 32;

byte* spplus = spx + 32;

for (unsigned int x = mminX; x <= mmaxX; x += 32)

{

//Imaging SDK uses Blue, Green, Red, Alpha Image Format with 8 bits/channel

// Load from memory

uint8x8x4_t loaded = vld4_u8(spx);

// Expand to 16 bit integers

uint16x8_t bin = vmovl_u8(loaded.val[0]);

uint16x8_t gin = vmovl_u8(loaded.val[1]);

uint16x8_t rin = vmovl_u8(loaded.val[2]);

// Do all the multiplications

uint16x8_t p3 = vmulq_n_u16(bin, 14);

uint16x8_t p1 = vmulq_n_u16(gin, 90);

uint16x8_t p4 = vmulq_n_u16(gin, 75);

uint16x8_t p2 = vmulq_n_u16(rin, 46);

uint16x8_t p5 = vmulq_n_u16(rin, 38);

// Save the result from the previous iteration

if (x > mminX)

vst4_u8(spminus, saveLast);

// Do the additions to get the final result

uint16x8_t destr = vaddq_u16(p3, p4);

uint16x8_t destg = vaddq_u16(p1, p2);

destr = vaddq_u16(destr, p5);

// Cap the values at 2^15-1

destg = vminq_u16(destg, maxes);

destr = vminq_u16(destr, maxes);

// Prefetch the result for the next iteration (makes loading a lot faster)

if (x < mmaxX)

__prefetch(spplus);

// Divide by 2^7 by shifting right 7 bits

destg = vshrq_n_u16(destg, 7);

destr = vshrq_n_u16(destr, 7);

// Compact to 8 bit integers

loaded.val[1] = vmovn_u16(destg);

loaded.val[2] = vmovn_u16(destr);

// Update pointers

spx += 32;

spplus += 32;

spminus += 32;

// Store the results in a register so that they can be stored to main memory later

saveLast = loaded;

}

// Store the last result

vst4_u8(spminus, saveLast);

xOffset += iwidth4;

}

}

## Performance

Device | Version 1 | Version 2 | Version 3 | Version 4 | Version 5 | Version 6 | Version 7 |
---|---|---|---|---|---|---|---|

Lumia 925 | 3-5 FPS | 5 FPS | 5-6 FPS | 6 FPS | 10-11 FPS | 11 FPS | 11-12 FPS |

## Using the Effect

To use the filter in a C# application, use the DelegatingEffect class. For example:

Note that this version uses NEON intrinsics, so it won't run on the emulator. Use CustomEffectNative.ColorblindEffect16bit if you need to run on the emulator.

You can then use customEffect like any other effect in the Imaging SDK.

On the right sidebar, an example of the filter's usage is shown. With the filter applied (bottom images), you see the images as someone with colorblindness would. Note that the numbers you can see (and can't see) are different in the two images, which is how the tests determine if someone is colorblind.

## Optimization Walkthrough

We can simulate color-blindness (i.e. the inability to distinguish red-green) by: Converting the image to the YCbCr color space Deleting all red-green information Then converting back to RGB. Wikipedia gives these formulas from the JPEG specification for steps 1 and 3:

- Y = .299*r + .587*g + .114*b
- c
_{b}= 128 - .168736*r - .331264*g + .5*b - c
_{r}= 128

- R = Y + 1.402*(c
_{r}- 128); - G = Y - .34414*(c
_{b}- 128) - .71414*(c_{r}- 128); - B = Y + 1.772*(c
_{b}- 128);

Deleting the red-green information simply consists of setting C_{r} to 128.
Thus, a simple implementations is:

double b = sourcePixels[xOffset + x];

double g = sourcePixels[xOffset + x + 1];

double r = sourcePixels[xOffset + x + 2];

byte a = sourcePixels[xOffset + x + 3];

double Y = .299*r + .587*g + .114*b;

double cb = 128 - .168736*r - .331264*g + .5*b;

double cr = 128;

double newR = Y + 1.402*(cr - 128);

double newG = Y - .34414*(cb - 128) - .71414*(cr - 128);

double newB = Y + 1.772 * (cb - 128);

if (newR > 255) newR = 255;

if (newR < 0) newR = 0;

if (newG > 255) newG = 255;

if (newG < 0) newG = 0;

if (newB > 255) newB = 255;

if (newB < 0) newB = 0;

targetPixels[xOffset + x] = (byte) newB;

targetPixels[xOffset + x + 1] = (byte) newG;

targetPixels[xOffset + x + 2] = (byte) newR;

targetPixels[xOffset + x + 3] = a;

**3-5 FPS**

The first optimization we can make is to mathematically combine the formulas, by plugging the formulas for Y’ and C_{b} and 128 for C_{r} into the formulas for R, G, B. This skips the intermediate step of converting to YCbCr, which is not really needed. Our formulas become:

- R=0.114 b + 0.587 g + 0.299 r
- G=-0.05807 b + 0.701001 g + 0.357069 r
- B=b

And an implementation:

double b = sourcePixels[xOffset + x];

double g = sourcePixels[xOffset + x + 1];

double r = sourcePixels[xOffset + x + 2];

byte a = sourcePixels[xOffset + x + 3];

double newR = 0.114*b + 0.587*g + 0.299*r;

double newG = -0.05807*b + 0.701001*g + 0.357069*r;

double newB = b;

if (newR > 255) newR = 255;

if (newR < 0) newR = 0;

if (newG > 255) newG = 255;

if (newG < 0) newG = 0;

if (newB > 255) newB = 255;

if (newB < 0) newB = 0;

targetPixels[xOffset + x] = (byte) newB;

targetPixels[xOffset + x + 1] = (byte) newG;

targetPixels[xOffset + x + 2] = (byte) newR;

targetPixels[xOffset + x + 3] = a;

**5 FPS**

We want better performance, so let’s see if we can use integer arithmetic instead. With integer arithmetic, we can’t multiply by fractional constants, so instead of computing R, G, and B directly, we’ll compute m*R, m*g, m*B and then divide by m at the end. Division is normally computationally expensive, so we’ll set m=2^n, so that dividing becomes a right bitshift. With n=8, the formulas are:

- R = 29 b + 150 g + 77 r
- G = -15 b + 179 g + 91 r
- B=b

Implemented:

int b = sourcePixels[xOffset + x];

int g = sourcePixels[xOffset + x + 1];

int r = sourcePixels[xOffset + x + 2];

byte a = sourcePixels[xOffset + x + 3];

int newR = 29 * b + 150 * g + 77 * r;

int newG = -15 * b + 179 * g + 91 * r;

if (newR > MAX) newR = MAX;

if (newG > MAX) newG = MAX;

if (newG < 0) newG = 0;

targetPixels[xOffset + x] = (byte) b;

targetPixels[xOffset + x + 1] = (byte) (newG >> 8);

targetPixels[xOffset + x + 2] = (byte) (newR >> 8);

targetPixels[xOffset + x + 3] = a;

**5-6 FPS**

The next optimization we’ll make is to switch to 16 bit integers. Unfortunately, now we have to be really careful of overflow. If we use signed 16 bit integers (int16_t), then we must be between -2^15 and 2^15-1 at every intermediate step for any input pixel. With these equations, it’s not possible. For example (r=255, g=255, b=0) produces a result of (57885,68850,0). 68850 is significantly greater than 2^16, so this causes overflow. We could try checking and correcting after every operation, but that would probably be slower than the 32 bit integer version. Instead, we are going to sacrifice a little accuracy and let n=7 and use unsigned integers. This lets us calculate (r=255, g=255, b=0) correctly, but that -15 in the formula for G becomes a big problem when trying to calculate (r=0, g=0, b=255). The easiest solution is to turn that -15 into a 0. Again, this sacrifices some accuracy but prevents us from ever overflowing.

byte b = sourcePixels[xOffset + x];

byte g = sourcePixels[xOffset + x + 1];

byte r = sourcePixels[xOffset + x + 2];

byte a = sourcePixels[xOffset + x + 3];

uint16_t newR = 14 * b + 75 * g + 38 * r;

uint16_t newG = 90 * g + 46 * r;

if (newR > MAX) newR = MAX;

if (newG > MAX) newG = MAX;

targetPixels[xOffset + x] = b;

targetPixels[xOffset + x + 1] = (byte)(newG >> 7);

targetPixels[xOffset + x + 2] = (byte)(newR >> 7);

targetPixels[xOffset + x + 3] = a;

**6 FPS**

The next optimization is the easiest one so far: Switch from debug mode to release mode and launch without the debugger (Ctrl + F5).
This is about as far as we can go with portable C++ code. The next optimizations use NEON intrinsics, which mean they are specific to ARM chips and won’t run in the emulator.
We will be using a strategy similar to the one in this wiki article.
The first step is to load the data from memory using the de-interleave operation (vld4_u8). Then convert from 8 bit integers to 16 bit integers, perform the multiplications (vmulq) and additions (vaddq) we need. Finally, we enforce the max value (vminq), shift right by 7 bits (vshrq), convert back to 8 bit integers (vmovn) and save (vst4). The q suffix indicates that the intrinsic operates on 128 bits (8x 16-bit integers) at a time. The u16 suffix stands for unsigned 16-bit integer.

uint16x8_t maxes = vdupq_n_u16(0x7FFF);

unsigned int iwidth4 = imageWidth * 4;

unsigned int xOffset = minY*iwidth4;

if ((maxX - minX) % 32 != 0)

throw (-1);

if ((maxX - minX) < 64 != 0)

throw (-2);

for (unsigned int y = minY; y < maxY; y++)

{

for (unsigned int x = minX; x < maxX; x += 32)

{

//Imaging SDK uses Blue, Green, Red, Alpha Image Format with 8 bits/channel

// Load from memory

uint8x8x4_t loaded = vld4_u8(&sourcePixels[xOffset + x]);

// Expand to 16 bit integers

uint16x8_t bin = vmovl_u8(loaded.val[0]);

uint16x8_t gin = vmovl_u8(loaded.val[1]);

uint16x8_t rin = vmovl_u8(loaded.val[2]);

// Do all the multiplications

uint16x8_t p3 = vmulq_n_u16(bin, 14);

uint16x8_t p1 = vmulq_n_u16(gin, 90);

uint16x8_t p4 = vmulq_n_u16(gin, 75);

uint16x8_t p2 = vmulq_n_u16(rin, 46);

uint16x8_t p5 = vmulq_n_u16(rin, 38);

// Do the additions to get the final result

uint16x8_t destr = vaddq_u16(p3, p4);

uint16x8_t destg = vaddq_u16(p1, p2);

destr = vaddq_u16(destr, p5);

// Cap the values at 2^15-1

destg = vminq_u16(destg, maxes);

destr = vminq_u16(destr, maxes);

// Divide by 2^7 by shifting right 7 bits

destg = vshrq_n_u16(destg, 7);

destr = vshrq_n_u16(destr, 7);

// Compact to 8 bit integers

loaded.val[1] = vmovn_u16(destg);

loaded.val[2] = vmovn_u16(destr);

// Store the results

vst4_u8(&targetPixels[xOffset + x], loaded);

}

xOffset += iwidth4;

}

**10-11 FPS**

In order to better understand what further changes we can make to improve the performance, we will profile the code. From the Debug menu, hit Start Windows Phone Application Analysis. Select the Execution option, then let the app stay on the page you want to profile for 30 - 45 seconds before hitting End Session. After the report is generated, you should see the function under Functions Doing Most Individual Work. Clicking on it shows which lines are taking the most time.

To understand why these lines take so much time, we need to know a few more details about the ARM NEON architecture:

- Passing data from the normal memory to NEON is slow, but passing data from NEON back to main memory is even slower.
- NEON instructions are pipelined. For example, whenever possible, you should avoid using a result immediately after computing it.

In light of this and the profiling results, we can rearrange some of the instructions. In particular, we can offset the load and store so that they can complete when in the background and not right when the results need to be used.

uint16x8_t maxes = vdupq_n_u16(0x7FFF);

unsigned int iwidth4 = imageWidth * 4;

unsigned int xOffset = minY*iwidth4;

if ((maxX - minX) % 32 != 0)

throw (-1);

if ((maxX - minX) < 64 != 0)

throw (-2);

for (unsigned int y = minY; y < maxY; y++)

{

uint8x8x4_t saveLast;

for (unsigned int x = minX; x < maxX; x += 32)

{

//Imaging SDK uses Blue, Green, Red, Alpha Image Format with 8 bits/channel

// Load from memory

uint8x8x4_t loaded = vld4_u8(&sourcePixels[xOffset + x]);

// Expand to 16 bit integers

uint16x8_t bin = vmovl_u8(loaded.val[0]);

uint16x8_t gin = vmovl_u8(loaded.val[1]);

uint16x8_t rin = vmovl_u8(loaded.val[2]);

// Do all the multiplications

uint16x8_t p3 = vmulq_n_u16(bin, 14);

uint16x8_t p1 = vmulq_n_u16(gin, 90);

uint16x8_t p4 = vmulq_n_u16(gin, 75);

uint16x8_t p2 = vmulq_n_u16(rin, 46);

uint16x8_t p5 = vmulq_n_u16(rin, 38);

// Save the result from the previous iteration

if (x > minX) // ... if this isn't the first one

vst4_u8(&targetPixels[xOffset + x - 32], saveLast);

// Do the additions to get the final result

uint16x8_t destr = vaddq_u16(p3, p4);

uint16x8_t destg = vaddq_u16(p1, p2);

destr = vaddq_u16(destr, p5);

// Cap the values at 2^15-1

destg = vminq_u16(destg, maxes);

destr = vminq_u16(destr, maxes);

// Prefetch the result for the next iteration (makes loading a lot faster)

if (x < maxX) // if this isn't the last one

__prefetch(&sourcePixels[xOffset + x + 32]);

// Divide by 2^7 by shifting right 7 bits

destg = vshrq_n_u16(destg, 7);

destr = vshrq_n_u16(destr, 7);

// Compact to 8 bit integers

loaded.val[1] = vmovn_u16(destg);

loaded.val[2] = vmovn_u16(destr);

// Store the results in a register so that they can be stored to main memory later

saveLast = loaded;

}

// Store the last result

vst4_u8(&targetPixels[xOffset + maxX], saveLast);

xOffset += iwidth4;

}

**11-12 FPS**

The final optimization strategy this article will introduce is analyzing the disassembly to remove any unnecessary instructions. Place a breakpoint right before the first loop and run the code (F5). When the breakpoint is hit, open the disassembly window from Debug->Windows->Disassembly or with Ctrl-Alt-D. The disassembly is almost exactly what we’d expect, except we notice that it takes a few instructions to calculate the address for the load and store. We can precompute this and eliminate those instructions. We also notice a store to memory of loaded[0] and loaded[1] prior to multiplication, even though this value is never used in memory. As far as I can tell, this is either to aid debugging or a compiler bug in which it is not able to identify that the memory is never used.
This brings us to the final code listed in the source code section.

## Conclusion

In this article, we built and optimized a custom effect that simulates colorblindness. You can use this effect to identify if your website and apps are accessible to the colorblind population. Try taking a colorblindness test with and without the app!

## References and Other Useful Links

- https://github.com/nokia-developer/real-time-filter-demo
- http://developer.nokia.com/community/wiki/Creating_and_optimizing_a_custom_effect_for_the_Nokia_Imaging_SDK
- http://developer.nokia.com/community/wiki/Image_processing_optimization_techniques
- http://developer.nokia.com/community/wiki/WP8:_Optimizing_your_signal_processing_algorithms_using_NEON
- http://community.arm.com/groups/processors/blog/2010/06/28/coding-for-neon--part-3-matrix-multiplication
- http://infocenter.arm.com/help/topic/com.arm.doc.dui0491i/Badcdfad.html
- http://infocenter.arm.com/help/topic/com.arm.doc.dui0489i/Bcfjicfj.html
- http://infocenter.arm.com/help/topic/com.arm.doc.ddi0344k/Babfjcjb.html

(no comments yet)