As the main Roadmap says, the focus for the 0.9.1 cycle is performance and accurate timing (two things that sound closely related but really aren’t). There’s three main pieces of work here, not necessarily in any particular order:
1. Register allocator for the translator
Currently (on the test machines) the translated code maxes out at about 400 MIPS (millions of instructions/second) excluding everything else including memory access. (With memory access and other support it’s about half of that, although obviously these numbers depend on the exact instruction mix). We need to do better.
Plan at the moment is to implement a very simple/low-level intermediate representation, along with a linear-scan allocator – since we’re only doing basic blocks at the moment, this doesn’t require any complex analysis. Although it would be interesting to experiment (if time permits) to see if there’s a benefit to building whole functions at once.
2. SH4 cycle accuracy
By default lxdream currently runs the SH4 at 100 MIPS, with each instruction taking exactly 10ns. Considering that the real instruction rate could run anywhere from 10 MIPS to 400 MIPS, it’d be nice to be a little more accurate. I’m under no illusions that a perfectly accurate timing model is likely to be fast enough for use, but if we can get the accuracy to within about 1% of the real rate, I’ll be happy. First job here is to build a precise pipeline model for testing purposes along with a decent cache implementation, and then start looking at the best way to get a reasonably fast approximation. This is probably going to take the most time to be both correct and fast.
3. Rendering optimization
This is the biggest performance gap at the moment, and it’s a very different optimization problem. The first thing to do here is to reverse the geometry tiling performed and render the whole scene at once instead of tile-at-a-time (this pretty much doubles the entire system speed by itself, depending on the scene of course). After that it’s probably going to be the more typical tedious process of profiling, tweaking, etc. We may be able to split the rendering off into a separate thread as well, although that introduces other complexities…
Besides the above, of course there’s various smallish things that will probably be done along the way (handling translucent triangle intersections, tweaking the z-buffer, introducing some realistic I/O timings and implementing basic VMU functionality) which will fit in between the above somewhere.
- Fix MMU code not actually using the translation cache… *oops*
- Create a sorted copy of the UTLB for binary searches – MMU code is now up to around half the speed of non-MMU code (in other words, it’s 10 times faster than it was in 0.9), making dc-linux actually somewhat usable now ^_^
- PVR2: Fix incorrect frame size/width calculations
- OSX: Fix the zero (0) key not being recognized on the keyboard.