Archive for the ‘Development’ Category

November 7th, 2008 by nkeynes
The Road to 0.9.1
Posted in Development

As the main Roadmap says, the focus for the 0.9.1 cycle is performance and accurate timing (two things that sound closely related but really aren’t). There’s three main pieces of work here, not necessarily in any particular order:

1. Register allocator for the translator

Currently (on the test machines) the translated code maxes out at about 400 MIPS (millions of instructions/second) excluding everything else including memory access. (With memory access and other support it’s about half of that, although obviously these numbers depend on the exact instruction mix). We need to do better.

Plan at the moment is to implement a very simple/low-level intermediate representation, along with a linear-scan allocator – since we’re only doing basic blocks at the moment, this doesn’t require any complex analysis. Although it would be interesting to experiment (if time permits) to see if there’s a benefit to building whole functions at once.

2. SH4 cycle accuracy

By default lxdream currently runs the SH4 at 100 MIPS, with each instruction taking exactly 10ns. Considering that the real instruction rate could run anywhere from 10 MIPS to 400 MIPS, it’d be nice to be a little more accurate. I’m under no illusions that a perfectly accurate timing model is likely to be fast enough for use, but if we can get the accuracy to within about 1% of the real rate, I’ll be happy. First job here is to build a precise pipeline model for testing purposes along with a decent cache implementation, and then start looking at the best way to get a reasonably fast approximation. This is probably going to take the most time to be both correct and fast.

3. Rendering optimization

This is the biggest performance gap at the moment, and it’s a very different optimization problem. The first thing to do here is to reverse the geometry tiling performed and render the whole scene at once instead of tile-at-a-time (this pretty much doubles the entire system speed by itself, depending on the scene of course). After that it’s probably going to be the more typical tedious process of profiling, tweaking, etc. We may be able to split the rendering off into a separate thread as well, although that introduces other complexities…

Besides the above, of course there’s various smallish things that will probably be done along the way (handling translucent triangle intersections, tweaking the z-buffer, introducing some realistic I/O timings and implementing basic VMU functionality) which will fit in between the above somewhere.

Changes

  • Fix MMU code not actually using the translation cache… *oops*
  • Create a sorted copy of the UTLB for binary searches – MMU code is now up to around half the speed of non-MMU code (in other words, it’s 10 times faster than it was in 0.9), making dc-linux actually somewhat usable now ^_^
  • PVR2: Fix incorrect frame size/width calculations
  • OSX: Fix the zero (0) key not being recognized on the keyboard.
October 31st, 2008 by nkeynes
Core Optimization
Posted in Development

I’ve been working on “core” optimization this week (which for this purpose is everything except rendering and I/O) – aiming to get the test loads down to around 40% CPU usage on my machine (leaving the rest for rendering). By comparison 0.9 is runing at 80-100%, and that’s with the SH4 underclocked.

Compiler optimizations:

  1. Turn on -fomit-frame-pointer (for 32-bit builds). I’ve been wanting to do this for a while, but it had the slight problem of completely breaking exception handling. Fortunately there is a solution: build with -fexceptions (or one of the other flags that emit eh_frame sections) and use _Unwind_Backtrace instead of manual frame-pointer chasing.
  2. Enable SSE2 math for i386-linux (already enabled on all the other platforms)
  3. Convert the functions called from the translator to use register-passing calling conventions (regparm) – this is a decent 5-6% improvement (Note that these three all apply only to 32-bit code – the 64-bit ABI already behaves this way by default)
  4. OS X: Disable PIC code generation (I now discover that for some ineffable reason Apple enable it by default, unlike most platforms) – this is a about a 12% speedup by itself, which pretty much brings it back to par with the Linux version. If I’d known about this earlier…

Translated code generation:

  1. Remove all the ugly generated fpscr check/branch for the different FPU modes, and just check it at the start of the translation block – if it’s different from last time, flush and retranslate.  Small win (about 3-4%) on FP code. (This was suggested a long time ago by dknute but I hadn’t gotten around to doing it until now).
  2. Implement SSE3 versions of FIPR and FTRV – the latter gives us a 4.5% improvement overall on typical rendering tests (eg 1-2% FTRVs) – that’s pretty good for tweaking one instruction.
  3. Optimize the store-queue write path a little bit (used fairly heavily by most apps)

I’ve also added a couple of new configure options: –disable-optimized turns off all optimizations and compiles with -g3, and –enable-profiled does (surprisingly enough) a profiling build.

Results after all of the above (on one particular test load): 32-bit OS X: 36% faster; 32-bit Linux: 21% faster; 64-bit Linux: 12% faster. 32-bit + 64-bit versions are now performing almost identically, with the 64-bit build just a hair in front.

Of course, this doesn’t directly translate into equivalently better frame-rates as we’re more limited by render performance than core speed at the moment, but every little bit still helps.

October 24th, 2008 by nkeynes
Week of bugs
Posted in Development

I’ve been tidying up a number of little issues that have been hanging around – nothing major but it’s good to get them out of the way. So this should be about it for the release now unless something release-critical turns up.

Changes

  • Fix assorted minor compile warnings
  • Fix save-state compatibility between 32-bit and 64-bit platforms
  • Fix save-state loading in headless mode
  • Fix make distcheck
  • Increase ALSA start buffer size (sounds much less choppy now)
  • SH4: Fix yet another flag-clobbering case. That should be all of them now *fingers crossed*
  • PVR2: Fix texcache reset breaking data structure invariants
  • PVR2: Fix FBO reuse when using more than 2 buffer sizes (crash)
  • GUI: Display an error message when unable to run rather than just disabling the button
  • GTK: Remove annoying error messages when loading save-state previews
  • OSX: Add preferences toolbar item to main window
  • OSX: Fix changes to path properties not taking effect until restart
October 16th, 2008 by nkeynes
Feature-freeze for 0.9
Posted in Development

The triangle sort improvements are in now (that being the most requested fix in the poll), which has made a big difference to many scenes. It’s still not 100% correct (and the 100% correct version needs a lot more time than I have right now), but the situations it gets wrong tend not to crop up so often in real scenes. From this point I’m just tracking down a few rendering glitches and related bugs which may or may not be fixed in time for release.

The bad news is that 0.9 will not have complete or perfect PVR2 emulation as originally envisioned. The good news is that most of the features that are still missing now are used fairly rarely, and the overall rendering quality is a huge jump over 0.8.*. And we’ll only continue to improve the quality and performance in following releases, of course.

What didn’t make the cut:

  • Accumulation buffer + bump mapping (although the error dialogs have been killed)
  • Shadow volumes for translucent polygons
  • Strip buffers

Changes

  • PVR2: Render-to-texture actually works now, with some internal simplifications
  • PVR2: Made triangle sorting algorithm approximate correctness a little more closely.
  • SH4: Fixed a few places where the T flag was accidentally being clobbered
  • OSX: Add the default config file into the bundle
October 5th, 2008 by nkeynes
Shadow volumes are in
Posted in Development

for opaque polygons, and they’re looking quite nice. And it didn’t even have as big a performance hit as I had expected. Translucent shadows need to be dealt with separately, alongside the translucent poly sorting (and will probably be quite a bit more expensive, but then again they’re more expensive on the actual PVR2 as well).

Other than that, Real Life(tm) has been quite busy recently, so progress has been a little slower than one might like, but we’re still on track for 0.9 at the moment.

Changes

  • Add opaque shadow volumes
  • Fix some punchout ordering problems
  • Add EXT_packed_depth_stencil support
September 19th, 2008 by nkeynes
New Poll: Feature requests for 0.9
Posted in Development

Now taking requests here: Poll

I’ve been looking into why Ketm (among other games) was showing some strange timing problems – turns out that they’re very sensitive to the exact SH4 CPU speed, presumably due to use of timing loops. Unfortunately the raw instruction rate, like most modern CPUs, is quite complicated, so this is going to take some time to get right. Admittedly we were certainly going to need cycle-accurate emulation eventually though (or at least close-to-cycle-accurate).

Otherwise work on the accumulation buffer + shadow volumes (at least for opaque polygons) is progressing nicely, but nothing to brag about as yet. In a rare bit of good news, it appears that ATI’s latest 8.9 drivers do in fact (at least claim to) support EXT_packed_depth_stencil at long last, meaning that people with ATI cards will actually be able to see the shadow effects now. Well, once they’re working at all, anyway.

No changelog today, as I find the recent lack of commits disturbing…

September 13th, 2008 by nkeynes
Ooh look, new features
Posted in Development

You know, with all the porting and bug fixing this year, I think it’s been a while since anything new was actually added to lxdream. But it looks like we might finally be past most of that stuff, and I can start getting back on track. Well ok, the lightgun support is pretty random, but it’s good to get the input system more or less fully sorted out now.

Remaining for 0.9: Accumulation buffer, shadow volumes, and bump mapping. And a ‘real’ translucent polygon sorting algorithm, along with a bunch of minor things I need to tidy up.

Changelog

  • ASIC: Start adding the secondary PVR DMA channel
  • IDE: Implement drive status command (10h)
  • PVR2: Add vertex fog & rudimentary LUT fog (needs to be moved to the pixel shader for pixel-level correctness)
  • PVR2: Add render-to-texture support (completely untested and probably broken)
  • PVR2: Implement horizontal scaler buffer write-back
  • MAPLE: Add lightgun support
  • GUI: Allow mouse events to be used for controller buttons
  • OSX: Disable run button/menu when no program is loaded
September 2nd, 2008 by nkeynes
More bugfixes
Posted in Development

Ok, that’s about a wrap for the ‘month of bugs’. Mostly minor stuff again, but I’d be interested in hearing about anything that’s still failing to start with the latest svn revisions. And now back to some of those video bits and pieces that I keep putting off… (well, that and continuing to hammer on the ARM test cases as time permits)

Changes

  • SH4: Mask SR, FPSCR, and the MMU/General IO registers properly
  • SH4: Remove a number of harmless warnings that have been cluttering up the output
  • SH4: Initial real perf counter implementation, at least for 0×23 (SH4 clock count)
  • AICA: Set a sensible default value on the 0×2808 register
  • GDROM: Fix reading NRG images with multiple DAOX/DAOI sections
  • IDE: Generate the DMA interrupt on end of DMA rather than end of read (duh…)
  • PVR2: Adjust hclip appropriately when using the horizontal scaler
  • PVR2: Fix DMA/SQ writes to the VRAM32 region
  • GUI: Display the current disc title in the title bar
  • GUI: Change mouse grab to only take effect while actually running, and a grabby controller is configured (ie the DC Mouse).
August 20th, 2008 by nkeynes
Small Bug Fixes
Posted in Development

I’ve been taking some time to tidy up a few lingering issues with the emulation, mostly small fixes but it does tend to take some time to find the bugs. As part of this, I’ve also started putting together a test harness for the ARM processor, but it’s not quite working on the actual hardware yet (In other news, the G2 bus hates me with a rather astounding passion).

I’ll probably keep going through these for another couple of weeks at this stage – it would be nice to get through the remaining known black-screen/hang type of bugs before I get back to mangling the renderer.
Changes

  • SH4: Fix 64-bit bugs that broke the OS X 64-bit builds
  • SH4: Fix ASID usage in ITLB lookups
  • SH4: Fix core-exit on a translation-cache flush when it occurs in a branch delay slot
  • SH4: Add the 1Cxx..1Fxx mapping for the P4 region when accessed through the TLB
  • SH4: Add version register 0xFF000030 (read-only).
  • SH4: Add undocumented performance counters (just stubbed for now) . Removes the 0xFF100008 warnings, but breaks save-state compatibility.
  • ARM: Fix STM {R15} offset (should be +12, was +8)
  • ARM: Add missing half-word load/store instructions
  • ARM: Add test harness work-in-progress
  • Other: Cleanup all the -Wall compile-time warnings
August 4th, 2008 by nkeynes
Minor OS X update
Posted in Development

I’ve uploaded a new OS X binary package that actually runs on 10.4 now. It also includes a fix for the Intel GMA950, and a new .dst document icon. In any case, you probably only want this if the previous package didn’t work for you.