Talk:Creating and optimizing a custom effect for the Nokia Imaging SDK
Really amazing. I don't thinks use debug to compare tricks is a good idea. You can have really different result. I thinks you should use only release result.
Explain optimization tricks is a really good idea.
- inline c++ : the only point i thinks is unnecessary. It's the compilator job. You could use template function too.
- you can unroll loop code with template
- can you use aligned memory with arm NEON ?
- you should make test with parralel_for: http://msdn.microsoft.com/library/dd470426.aspx#parallel_for
talk) 22:17, 14 December 2013 (EET)(
SB Dev -
Thanks for the feedback. It's very valid criticism and I'll look into improving the article in the metioned ways.
As for inlining - you're right that it's something that can be left to the compiler if your goal is code optimization by using inlining itself. However we are using manual inlining and loop unrolling to get the source code into a format that makes it easier to rewrite to take advantage of SIMD/Neon instructions. Perhaps I can make it clearer inside the article somehow that those steps mainly serve to help us in doing so before people start spending time doing optimization steps in C++ that are done by the compiler anyway. Maybe I should remove the performance comparisons for those code samples - they're mainly in there right now to show people the issue with the first naive attempt at using SIMD instructions (that actually hurt performance).
Parallel_For might be a good fit for the outer loop. Have you had any experience if that interferes with the rendering pipeline in some way given that that is running async?Have you by chance an example of doing loop unrolling with templates?
talk) 12:58, 15 December 2013 (EET)(
I don't see why Parallel_for could interfere with async. look here: http://msdn.microsoft.com/en-us/library/hh750082.aspx#example_component
they use parallel_for with create_async. I don't know if you can choose granularity.
Normally, you can use it with SIMD. But i haven't test.about unrolling code with template, i found a exampel here : http://www.di.unipi.it/~nids/teaching/files/TMP_handout.pdf. If could make a sample(next week) if you want
talk) 13:17, 15 December 2013 (EET)(
SB Dev -
I'll look into it - I'm currently out of town and I guess I'll need some time really digest the info. Thanks for the links :)Another idea I want to try out soon is to switch to strided memory accesses using VLDn to avoid having to keep the alpha values in there and multiplying them by the neutral element (1).
talk) 14:22, 15 December 2013 (EET)(
i test few things and i found incredible news!!!!
- WP support OpenMP : http://msdn.microsoft.com/en-us/library/tt15eb9t.aspx
- Auto-Parallelization and Auto-Vectorization are activate : http://msdn.microsoft.com/en-us/library/hh872235(v=vs.110).aspx and
The Auto-Vectorizer analyzes loops in your code, and uses the vector registers and instructions in your computer to execute them, if it can. This can improve the performance of your code. The compiler targets the SSE2 instructions in Intel or AMD processors, or the NEON instructions on ARM processors.
- AMP is not supported (yet?)
talk) 16:00, 15 December 2013 (EET)(
Hamishwillee - A bit more feedback
I got some more feedback from another reviewer: "- Doubtless great article, but when I (who is not right audience) read it, it is quite engineer porn. So trying make it lighter to start with and to understand the essentials – to whom and why."
I agree with that. Practical advice was to add a graph showing the curve of performance with and without the optimizations in the introduction. Having a nice image in the introduction is almost always good advice for any article.
He also thought that it would be good to explain at what point it is worth trading off the optimisations - ie when should a developer seriously consider doing this.
Lastly, wanted clear statement about whether this advice is generic for any C++ situation, specific to imaging SDK or general to imaging.
talk) 03:13, 19 December 2013 (EET)(
Hamishwillee - There is a little in parallel_for in "Image processing optimization techniques"I don't know if it is a reasonable starting point for this.
talk) 03:32, 19 December 2013 (EET)(
SB Dev - Changes during review period/reworking of certain points
I have so far done only very light editing (mainly fixing links, etc.) since the competition end date as I was under the impression it might pose a problem for the reviewers.
I will have to rewrite quite a bit of my text (not for the actual content presented) but given that it's referencing the numbers for the debug configuration to follow Yan's suggestion of doing the comparisons for release code. Compiler optimization will also have to be moved to an earlier section to make sense of some of the manual steps then.
As for the when and why of optimization - that should be possible to add. The short version in my view is: build clean, easy to read code first and optimize it (quite often making it less readable in the process if you need it to work faster. Do so until it is fast enough for what you're trying to do. If you do real time previews you will have to move down to a processing time of less than 33 ms to allow for 30 fps - something that will almost certainly mean you'll have to lower resolution and/or move your code to the GPU even.
You also might want to use lossy versions creating deviations for such "fast" tasks and resort to the exact implementation when the user decides to store an image.
As for adding a graph - I'm not sure wether it might mislead people into thinking that it's the gains they will always see when using the optimization techniques. For some algorithms it's possible to be far faster (but I've opted for a simple effect to make it easier to follow what is being done and how, while living with smaller gains). I thought of showing the effect with different input parameters but then decided against as the article wasn't mainly about what was being done and rather about how to do it.
As for applicability of the information, it's currently two parts. The first is quite a bit of information on how to implement a custom effect for the Imaging SDK using C++ at all (as that is currently missing from the official documentation). We then start talking about comparing different implementations of an effect and that's when we actually start talking about optimization (although the move from C# to C++ is pretty significant performance wise. The techniques presented for doing so can be applied to any big datasets that require doing similar calculations on a large input vector. Imaging is a classic example of this but you could use the information just as well to speed up processing sound samples (e.g. modifying the amplitude of the audio signal would be something very similar to what our effect does to an image).
This is reflected in the fact that all changes after moving to C++ only occur in the implementation of the Process method, which is only connected to the SDK by it's input and output buffers.
I could definitely add some of those enhancements before my text rewrite for the "release only" numbers, question is: should I "now" or should I wait until review of competition material has finished?In the end as always: the feedback is much appreciated, I can understand where it's coming from and I will definitely use it to improve the article.
talk) 06:17, 19 December 2013 (EET)(
Hamishwillee - Great
If it is easier to make the changes now, do it. The reviewers have mostly covered your article and those that haven't can view the original in the history.
talk) 07:49, 19 December 2013 (EET)(
BuildNokia - Edited for clarity
Congratulations on winning the Nokia Imaging and Big UI Wiki Competition with this article.
This was a really interesting set of experiments to go through. I'm sure a lot of people will be reading this article, so I've done a fairly extensive edit for clarity. When you get a chance, please take a look at my changes and rollback anything that you don't like.
There was only one spot where I intentionally changed your meaning. I changed:
This alteration will exclude the green color component from processing, keeping it the same as in the original image. Therefore only the green channel will show deviations (as well as the overall deviation).
This alteration will exclude the green color component from processing, keeping it the same as in the original image. Only the blue and red channel deviations, as well as the overall deviation, will show.
Can you confirm whether my change was correct, and if not, change it back?Jen
talk) 23:53, 26 December 2013 (EET)(
SB Dev -
I didn't look at all your changes in detail but as far as I did they are definite improvements (I'm not a native speaker so quite a lot of what I write tends to sound somewhat odd).
I however had to roll back the one change you mentioned specifically. The comparer does not compare the original image to the processed image but the results of the reference effect and the comparison effect. This means that when we remove the green color component from processing the reference effect will still process green, while the comparison does not (leading to a deviation in the green channel but not in red and blue ones).Perhaps we should add a mention to the text of what exactlly gets compared to what to avoid a similar confusion for readers.
talk) 15:44, 27 December 2013 (EET)(
BuildNokia - Ahh I understand now
Thanks for clarifying. I read it again and it makes more sense to me this time. I tried to decide if we needed to change the text so that it will be easier for other people to understand, but came to the conclusion that it already says that we're comparing the reference effect to the altered effect. So I'm happy to leave it as-is.Jen
talk) 21:37, 31 December 2013 (EET)(
Hamishwillee - ToColor/FromColor might be biggest factor in poor C# performance
Just reread Custom Filter QuickStart for Nokia Imaging SDK again. That de constructs and then reconstructs the incoming pixels using bit shift, and then does the same thing using To/FromColor (so result is same output as input). Comparing bit shift vs To/FromColor gave him 18fps ish vs 11fps.
I think it would probably be more representative to compare the bit shift approach then - and also note the cost of To/FromColor.
talk) 08:36, 7 January 2014 (EET)(
SB Dev -
To/FromColor is the standard way to get at the pixel data shown off as part of the Imaging SDK documentation, which is what I use as the starting point (naive/straight forward implementation). It might be worth it to include the move to the bitshift as a C# only optimization technique before moving from Managed to Native Code. However I don't believe that to be the sole factor in the performance difference. Out of bounds checks performed when accessing an array by index in C# are a regular issue causing slowdowns vs. native code.
Internally To/FromColor is likely to do the BitShifts as well but it's also creating/destroying the Color structs instead of simply reusing byte registers which is the likely cause for the performance difference seen in the Custom Filter Quick Start (storing data back into memory instead of doing everything in the registers). It depends on the compiler and the underlying architecture (I don't know about ARM Assembler in detail - with SIMD instruction you can do something very similar) wether we'll see additional differences. On x86 you could simply load the whole 32 bit value into a 32 bit register and access each byte separately as each 32 bit register is accessible as 4 8 bit registers as well.
I can test it out once I have both devices used in the comparisons available to me again (currently I have neither one).
talk) 14:59, 8 January 2014 (EET)(
Hamishwillee - Thanks Oliver
I completely understand why you did it this way. I'm currently writing an introduction for Lumia Library which will cover custom filters drawing heavily from Rob's article - it will outline both options but prefer bit shift. The core concepts article does note that the custom filter it shows is not optimized, but does not explain why.
In any case, if you are able at some point to compare these, then that would be useful. No urgency.
talk) 08:10, 9 January 2014 (EET)(