WP8: Optimizing your signal processing algorithms using NEON

From Nokia Developer Wiki
Jump to: navigation, search

This article explains how to optimize the performance of your signal processing algorithms, using the ARM Neon intrinsics. By spending a little bit of time manually optimizing your C++ code, you can get significant speed improvements for your image processing, audio enhancements, FFT, DCT, JPEG, FIR and IIR filters... Or alternatively, why re-invent the well when you can go for what the community has already done for you

WP Metro Icon WP8.png
SignpostIcon WP7 70px.png
Article Metadata
Tested with
Devices(s): Nokia Lumia 820, Nokia Lumia 920) -->
Platform(s): Windows Phone 8
Windows Phone 8
Windows Phone 7.5
Keywords: ARM, NEON, WP8, intrinsics, DSP, Digital signal processing, performance, assembly
Created: Mansewiz (11 Dec 2012)
Last edited: girishpadia (30 Jan 2013)

Note.pngNote: This is an "internal" entry in the Windows Phone 8 Wiki Competition 2012Q4. The author is a Nokia / Microsoft employee.



The chipset that powers the Windows Phone 8 devices is a Qualcomm Snapdragon S4. The chipset contains all the critical blocks for a smart phone, amongst other the CPU, the GPU, the memory and the modem may be of interest for the developer. Your Windows Phone application will mainly run on the CPU, while the UI components and the transitions are rendered by the GPU.

The CPU in the Snapdragon S4 is using the ARM7 instruction set. The ARM CPU architecture is very popular and today it is used in almost all the mobile phones. When writing your application in native code (C++), the compiler converts your code to an ARM7 object code. The compilers are nowadays very good and create very optimized code for the ARM.

The NEON/SIMD is one extension to the ARM7 instruction set that excels in signal processing and multimedia: it is really good with the algorithms that manipulate a lot of data and doing very repetitive operations. While the compilers will use the NEON instructions whenever it makes sense, they are not yet able to optimize signal processing algorithm as well as a human can. By doing hand optimized NEON code, you typically can achieve 2 to 4 times faster execution of your algorithms. Microsoft has recognized this, and allows you to use the Neon intrinsics in its WP8 SDK.

If your code doesn't have very tight loops with a lot of mathematical instruction, you should probably not bother of optimizing your code for Neon : the increase in speed will not be significant. But read-on, if you're doing something around audio processing, image filters, FFT, DCT, motion estimators...

What is NEON

Simplifying things a lot, one could shortly describe Neon as an extension to the ARM Core, and that extension being an ALU with 32 64-bit registers. The ALU (Arithmetical logic Unit) can perform basic arithmetic operations like add, subtract, multiply, “multiply and add” and shifts. It supports both integers and floating point calculations and is optimized for low current consumption.


While Windows Phone 7 also use an ARM core with Neon instruction set, it is not possible for developer to use that functionality. Coding using Neon intrinsic is a new functionality to Windows Phone 8.
When optimizing for NEON, the target will be to :

  1. minimize the data transfers between the Memory and the NEON registers
  2. minimize the number of operations performed in the ALU.

Let's first concentrate on the item 2.

Packed SIMD

One of the main trick that you can use to minimize the number of operations in the ALU is to use the “Packed SIMD” processing. Thanks to the packed SIMD architecture, the SW developer can use any of the Neon's 32 64- bits registers in such a way that they are considered as vectors of elements of the same data type.


Data types can be: signed/unsigned 8-bit, 16-bit, 32-bit, 64-bit or single precision float. For example, 4 16-bit integers will fit in one 64 bits register. In one instruction, the ALU will perform the operation on every elements in the register. For example, the neon instruction vadd_s16 will perform 4 parallel additions of 16 bit signed integers:

// vin1 and vin2
// are both of type “4 packed signed 16-bits integers” (int16x4_t)
int16x4_t vout = vadd_s16(vin1, vin2);

The operation can be represented by this image:


That parallelism provides a significant performance increase. It is also possible to perform operation between the elements of a register (pairwise) For example, the neon instruction vpadd_s16 will add 2 contiguous 16 bit signed integers and store the result in a 32-bit cell :

// vin1 is 4 packed signed 16-bits integers int16x4_t
int32x2_t vout = vpaddl_s16(vin);

The operation can be represented by this image:


The intrinsics

Although we are very close to the metal, the code snippets above are not written in assembly, but in C++ using the NEON intrinsics. Actually, as far as I am aware, it is not possible to code directly in assembly for the Windows Phone 8.

What that means:

  • The C++ compiler will enforce type checking, while assembly doesn't.
  • In assembly you can control which registers (D0, D1, ...D32) you use. With the intrinsics, the compiler decides for you. And it can make stupid decisions sometimes.

The intrinsics types

The name of the types follow the convention <type><size>x<number of lanes>_t. For example, int16x4_t refers to 4 integers of 16 bits, packed in a 64 bits register.

int8x8_t int8x16_t
int16x4_t int16x8_t
int32x2_t int32x4_t
int64x1_t int64x2_t
uint8x8_t uint8x16_t
uint16x4_t uint16x8_t
uint32x2_t uint32x4_t
uint64x1_t uint64x2_t
float16x4_t float16x8_t
float32x2_t float32x4_t

The intrinsics instructions set

ARM ltd documents very well it's (assembly) instructions. All NEON instructions are here: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0489c/CIHGDEGD.html

Intrinsics are not that well documented, but there is a clear one-to-one mapping between the intrinsic and the assembly instruction. The list of intrinsics and their mapping to the assembly is described here: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0491c/CIHJBEFE.html

Loading and storing

As seen in a previous picture, at some point you will need to load data from your application to one of the Neon register. And you will also need to store the data from Neon into your application memory. Although you can load/store each elements from the memory to the register one-by-one, it is more efficient to use the Neon dedicated functionality, with the vld1 and vst1 operations, for example the following instruction will load your array into my_vec 64-bit register:

int16_t array[4] = {1,5,9,12};
int16x4_t my_vec = vld1_s16(array);

But if you prefer/need accessing a single element, it's possible to use the val method:

	int16_t elem1 = my_vect.val[0];
int16_t elem2 = my_vect.val[1];

A typical neon intrinsic implementation

With what we have seen so far, we can get a optimize a real operation. Let's say that you have an array of 16-bits integers (int16_t), and would like to sum all the elements. You could do something like this:

int sum_array(int16_t *array, int size)
     int sum =0;
for (int i=0; i<size;i++)
sum += array[i];
return sum;

But if you want to optimize your code with Neon instruction set, you could do something like this:

#include <stdint.h>
#include <assert.h>
#include <arm_neon.h>
/* return the sum of all elements in an array. This works by calculating 4 totals (one for each lane) and adding those at the end to get the final total */
int sum_array(int16_t *array, int size)
     /* initialize the accumulator vector to zero */
     int16x4_t acc = vdup_n_s16(0);
     int32x2_t acc1;
     int64x1_t acc2;
     /* this implementation assumes the size of the array is a multiple of 4 */
     assert((size % 4) == 0);
     /* counting backwards gives better code */
     for (; size != 0; size -= 4)
          int16x4_t vec;
          /* load 4 values in parallel from the array */
          vec = vld1_s16(array);
          /* increment the array pointer to the next element */
          array += 4;
          /* add the vector to the accumulator vector */
          acc = vadd_s16(acc, vec);
      /* calculate the total */
      acc1 = vpaddl_s16(acc);
      acc2 = vpaddl_s32(acc1);
      /* return the total as an integer */
      return (int)vget_lane_s64(acc2, 0);

This code is from the ARM documentation : http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0205j/BABGHIFH.html.

Existing libraries

As the example above points out, optimize code is not necessarily easy to read and optimizing may take quite a lot of time. But, since neon intrinsics are a widely used standard, there are a lot of resources on the web of existing implement of the common signal processing algorithms. With a little of Googling, you will find NEON optimized FIR filters, IIR filters, FFT, DCT, JPEG libraries, MPEG libraries and lot more.

Version Hint

Windows Phone: [[Category:Windows Phone]]
[[Category:Windows Phone 7.5]]
[[Category:Windows Phone 8]]

Nokia Asha: [[Category:Nokia Asha]]
[[Category:Nokia Asha Platform 1.0]]

Series 40: [[Category:Series 40]]
[[Category:Series 40 1st Edition]] [[Category:Series 40 2nd Edition]]
[[Category:Series 40 3rd Edition (initial release)]] [[Category:Series 40 3rd Edition FP1]] [[Category:Series 40 3rd Edition FP2]]
[[Category:Series 40 5th Edition (initial release)]] [[Category:Series 40 5th Edition FP1]]
[[Category:Series 40 6th Edition (initial release)]] [[Category:Series 40 6th Edition FP1]] [[Category:Series 40 Developer Platform 1.0]] [[Category:Series 40 Developer Platform 1.1]] [[Category:Series 40 Developer Platform 2.0]]

Symbian: [[Category:Symbian]]
[[Category:S60 1st Edition]] [[Category:S60 2nd Edition (initial release)]] [[Category:S60 2nd Edition FP1]] [[Category:S60 2nd Edition FP2]] [[Category:S60 2nd Edition FP3]]
[[Category:S60 3rd Edition (initial release)]] [[Category:S60 3rd Edition FP1]] [[Category:S60 3rd Edition FP2]]
[[Category:S60 5th Edition]]
[[Category:Symbian^3]] [[Category:Symbian Anna]] [[Category:Nokia Belle]]

This page was last modified on 30 January 2013, at 13:49.
202 page views in the last 30 days.