I’ve been working on “core” optimization this week (which for this purpose is everything except rendering and I/O) – aiming to get the test loads down to around 40% CPU usage on my machine (leaving the rest for rendering). By comparison 0.9 is runing at 80-100%, and that’s with the SH4 underclocked.
- Turn on -fomit-frame-pointer (for 32-bit builds). I’ve been wanting to do this for a while, but it had the slight problem of completely breaking exception handling. Fortunately there is a solution: build with -fexceptions (or one of the other flags that emit eh_frame sections) and use _Unwind_Backtrace instead of manual frame-pointer chasing.
- Enable SSE2 math for i386-linux (already enabled on all the other platforms)
- Convert the functions called from the translator to use register-passing calling conventions (regparm) – this is a decent 5-6% improvement (Note that these three all apply only to 32-bit code – the 64-bit ABI already behaves this way by default)
- OS X: Disable PIC code generation (I now discover that for some ineffable reason Apple enable it by default, unlike most platforms) – this is a about a 12% speedup by itself, which pretty much brings it back to par with the Linux version. If I’d known about this earlier…
Translated code generation:
- Remove all the ugly generated fpscr check/branch for the different FPU modes, and just check it at the start of the translation block – if it’s different from last time, flush and retranslate. Small win (about 3-4%) on FP code. (This was suggested a long time ago by dknute but I hadn’t gotten around to doing it until now).
- Implement SSE3 versions of FIPR and FTRV – the latter gives us a 4.5% improvement overall on typical rendering tests (eg 1-2% FTRVs) – that’s pretty good for tweaking one instruction.
- Optimize the store-queue write path a little bit (used fairly heavily by most apps)
I’ve also added a couple of new configure options: –disable-optimized turns off all optimizations and compiles with -g3, and –enable-profiled does (surprisingly enough) a profiling build.
Results after all of the above (on one particular test load): 32-bit OS X: 36% faster; 32-bit Linux: 21% faster; 64-bit Linux: 12% faster. 32-bit + 64-bit versions are now performing almost identically, with the 64-bit build just a hair in front.
Of course, this doesn’t directly translate into equivalently better frame-rates as we’re more limited by render performance than core speed at the moment, but every little bit still helps.