Floating Point and x86 Assembly – Ramblings of a Spooky-Possum

I spent most of ANZAC day playing with the floating point unit in my Pentium-M. Some notes:

If your calculation involves Nan it slows down by about a factor of 50. Quoting Intel: “Out-of-range numbers cause very high overhead.”
On a Pentium-M fxch is essentially free. However, according to Intel’s optimisation manual, it is very expensive on Netburst (P4) CPUs. They are very different architectures.
Loading a floating point register which isn’t empty generates an invalid number. The corollary of this is that fincr isn’t as open to abuse as you’d hope it would be.
Time-wise fadd/fsubr < fmul << fdiv. You already knew that, but fmul is closer to fadd than you might think.
Calculating 2xy+c as ((x+x)*y)+c) is quicker than (((x*y)+(x*y))+c) despite using exactly the same instructions, just in a different order (fadd, fmul, fadd vs. fmul, fadd, fadd). It’s a dependent chain, so there really shouldn’t be any difference. It may simply be that the former version can fetch y while (x+x) is being calculated whereas the later has to fetch x and y before doing the multiplication.
The stack-based architecture of the FPU might have made sense for a separate co-processor, but it is awful in a super-scalar architecture. Still, gcc generally seems to generate faster code than when it uses the more modern SSE registers. Hand assembly still wins by about 30% – gcc seems to like gratuitous memory accesses (probably some strange requirement of the C language).
Architecture specific compilation, i.e. -march=pentium-m, has a significant speed advantage over generic -O3.