Go to this website:
http://cadal.cse.nsysu.edu.tw/
Click on publications and search for the following article:
"Cost-Effective Microarchitecture Optimization of the ARMTDMI Microprocessor"
It explains the hardware pipeline of the ARM chip in detail which is crucial for fast code.
Again, this is not for beginners, but it is a cool document.
Alex