lxdream.org :: News
lxdream 0.9.1
released Jun 29
Download Now
Lxdream is an emulator for the Sega Dreamcast system, running on Linux and OS X. While it is still in heavy development (and many features are buggy or unimplemented), it is capable of running most demos and some games.
January 28th, 2009 by nkeynes
Testing with Intel’s icc compiler
Posted in Development

Out of sheer curiousity, I thought it might be worth seeing how icc performs on lxdream – short answer, not too shabby at all. All tests otherwise with the same command options, best of 3 runs:

Compiler 5-second core runtime Improvement
gcc -O2 3.10s N/A
gcc -O2 -fprofile-use 2.96s 4.6%
icc -fast 2.96s 4.6%
icc -fast -prof-use 2.73s 12%

Profile runs using profile generated for the same test. 5% is kind of meh, but 12% on the icc profile build… ok that’s pretty nice. I will probably have to look at generating some decent general purpose profile traces for production builds

In any event, I’ve added support for building with icc, for the benefit of the 2 people who actually have it ^_^.

Otherwise I’ve finally got some very basic UTLB test cases in now, and fixed a number of bugs that turned up – it seems to be at least as stable as the old version was by now (which actually still had a few bugs too incidentally…)

January 14th, 2009 by nkeynes
Memory system rewrite
Posted in Development

The memory system rewrite is merged now – there are a few things I’m not completely happy with yet, and the old page_map isn’t quite gone completely, but on the whole it’s simpler, faster, and much more consistent. More importantly perhaps, UTLB translation is now _very_ cheap (3-instruction overhead[0] for OSes using the typical 4K page) – linux now boots and runs at full speed on my systems. There’s probably a few lingering issues and I’m still working on a good test suite for it[1], but most bugs are likely to be in things that never worked before anyway.

I also have some work-in-progress on the operand cache (nominally the original reason I started doing the rewrite…), but it’s still showing a bit more of a performance hit than I would like (10-15%). So currently I’m thinking this will probably wait for the next version before being fully integrated and finished. It does need to be done eventually though for correctness reasons, since the SH4 doesn’t ensure cache-coherency in hardware.

In any case, once the MMU tests are done I’ll get back on the translator upgrade. It’s looking at this stage like 0.9.1 will end up being almost purely a performance release, but since it should be at least twice as fast overall as 0.9, no one is really going to complain about that, right? ^_^

[0] We might be able to special case sdram access and get that case down to 0 instructions, but leaving that aside until after the op-cache is done…
[1] Annoyingly enough, there doesn’t seem to be a good way to recover from TLB multi-hit resets on the DC, which makes it a little hard to test that aspect of things… Even more annoyingly, the DC BIOS _does_ vector manual resets through 0x8c000018, but not any other reset.

December 9th, 2008 by nkeynes
I am Jack’s complete lack of update
Posted in Development

November has, unfortunately, been very busy with other matters, so lxdream hasn’t really gotten much attention lately – the little that I have gotten done has been rather scattershot. Right now, at least, I’m currently hacking on the memory system (ie, implementing the caches and bus timing), and will eventually get back to the translator sometime after that’s working.

But just to scare any programmers reading this, below is my latest configure test:

AC_DEFUN([AC_CHECK_STACK_TRICKS], [
AC_MSG_CHECKING([support for dirty stack tricks]);
AC_RUN_IFELSE([AC_LANG_SOURCE([[
void * __attribute__((noinline)) first_arg( void *x, void *y ) { return x; }
int __attribute__((noinline)) foo( int arg, void *exc ) {
    if( arg < 2 ) {
        *(((void **)__builtin_frame_address(0))+1) = exc;
    }
    return 0;
}
int main(int argc, char *argv[])
{
    goto *first_arg(&&start, &&except);
start:
    return foo(argc, &&except) + 1;
except:
    return 0;
}]])], [
    AC_MSG_RESULT([yes])
    $1 ], [
    AC_MSG_RESULT([no])
    $2 ])
])

For when longjmp just isn’t fast enough. Although one does have to jump through quite a few hoops to stop gcc from enthusiastically (mis)optimizing the test case…

November 7th, 2008 by nkeynes
The Road to 0.9.1
Posted in Development

As the main Roadmap says, the focus for the 0.9.1 cycle is performance and accurate timing (two things that sound closely related but really aren’t). There’s three main pieces of work here, not necessarily in any particular order:

1. Register allocator for the translator

Currently (on the test machines) the translated code maxes out at about 400 MIPS (millions of instructions/second) excluding everything else including memory access. (With memory access and other support it’s about half of that, although obviously these numbers depend on the exact instruction mix). We need to do better.

Plan at the moment is to implement a very simple/low-level intermediate representation, along with a linear-scan allocator – since we’re only doing basic blocks at the moment, this doesn’t require any complex analysis. Although it would be interesting to experiment (if time permits) to see if there’s a benefit to building whole functions at once.

2. SH4 cycle accuracy

By default lxdream currently runs the SH4 at 100 MIPS, with each instruction taking exactly 10ns. Considering that the real instruction rate could run anywhere from 10 MIPS to 400 MIPS, it’d be nice to be a little more accurate. I’m under no illusions that a perfectly accurate timing model is likely to be fast enough for use, but if we can get the accuracy to within about 1% of the real rate, I’ll be happy. First job here is to build a precise pipeline model for testing purposes along with a decent cache implementation, and then start looking at the best way to get a reasonably fast approximation. This is probably going to take the most time to be both correct and fast.

3. Rendering optimization

This is the biggest performance gap at the moment, and it’s a very different optimization problem. The first thing to do here is to reverse the geometry tiling performed and render the whole scene at once instead of tile-at-a-time (this pretty much doubles the entire system speed by itself, depending on the scene of course). After that it’s probably going to be the more typical tedious process of profiling, tweaking, etc. We may be able to split the rendering off into a separate thread as well, although that introduces other complexities…

Besides the above, of course there’s various smallish things that will probably be done along the way (handling translucent triangle intersections, tweaking the z-buffer, introducing some realistic I/O timings and implementing basic VMU functionality) which will fit in between the above somewhere.

Changes

  • Fix MMU code not actually using the translation cache… *oops*
  • Create a sorted copy of the UTLB for binary searches – MMU code is now up to around half the speed of non-MMU code (in other words, it’s 10 times faster than it was in 0.9), making dc-linux actually somewhat usable now ^_^
  • PVR2: Fix incorrect frame size/width calculations
  • OSX: Fix the zero (0) key not being recognized on the keyboard.
October 31st, 2008 by nkeynes
Core Optimization
Posted in Development

I’ve been working on “core” optimization this week (which for this purpose is everything except rendering and I/O) – aiming to get the test loads down to around 40% CPU usage on my machine (leaving the rest for rendering). By comparison 0.9 is runing at 80-100%, and that’s with the SH4 underclocked.

Compiler optimizations:

  1. Turn on -fomit-frame-pointer (for 32-bit builds). I’ve been wanting to do this for a while, but it had the slight problem of completely breaking exception handling. Fortunately there is a solution: build with -fexceptions (or one of the other flags that emit eh_frame sections) and use _Unwind_Backtrace instead of manual frame-pointer chasing.
  2. Enable SSE2 math for i386-linux (already enabled on all the other platforms)
  3. Convert the functions called from the translator to use register-passing calling conventions (regparm) – this is a decent 5-6% improvement (Note that these three all apply only to 32-bit code – the 64-bit ABI already behaves this way by default)
  4. OS X: Disable PIC code generation (I now discover that for some ineffable reason Apple enable it by default, unlike most platforms) – this is a about a 12% speedup by itself, which pretty much brings it back to par with the Linux version. If I’d known about this earlier…

Translated code generation:

  1. Remove all the ugly generated fpscr check/branch for the different FPU modes, and just check it at the start of the translation block – if it’s different from last time, flush and retranslate.¬† Small win (about 3-4%) on FP code. (This was suggested a long time ago by dknute but I hadn’t gotten around to doing it until now).
  2. Implement SSE3 versions of FIPR and FTRV – the latter gives us a 4.5% improvement overall on typical rendering tests (eg 1-2% FTRVs) – that’s pretty good for tweaking one instruction.
  3. Optimize the store-queue write path a little bit (used fairly heavily by most apps)

I’ve also added a couple of new configure options: –disable-optimized turns off all optimizations and compiles with -g3, and –enable-profiled does (surprisingly enough) a profiling build.

Results after all of the above (on one particular test load): 32-bit OS X: 36% faster; 32-bit Linux: 21% faster; 64-bit Linux: 12% faster. 32-bit + 64-bit versions are now performing almost identically, with the 64-bit build just a hair in front.

Of course, this doesn’t directly translate into equivalently better frame-rates as we’re more limited by render performance than core speed at the moment, but every little bit still helps.

October 25th, 2008 by nkeynes
Lxdream 0.9 “Shiny” Released
Posted in Releases

Go get it now on the download page. It’s looking very nice.

This is the first version where we can say that most software should “just work”, outside of a small number of known issues – so please report any other problems you encounter. Note however that the focus of 0.9 has been on accuracy – performance has not substantially changed from earlier versions. That will be the main aim of the work for 0.9.1 “Speedy”, along with timing precision and a few other things.

What’s new

  • Improved accuracy + compatibility (aka many bugfixes)
  • Shadow volumes, render-to-texture, fogging
  • Light-gun support

More details in the release notes

October 24th, 2008 by nkeynes
Week of bugs
Posted in Development

I’ve been tidying up a number of little issues that have been hanging around – nothing major but it’s good to get them out of the way. So this should be about it for the release now unless something release-critical turns up.

Changes

  • Fix assorted minor compile warnings
  • Fix save-state compatibility between 32-bit and 64-bit platforms
  • Fix save-state loading in headless mode
  • Fix make distcheck
  • Increase ALSA start buffer size (sounds much less choppy now)
  • SH4: Fix yet another flag-clobbering case. That should be all of them now *fingers crossed*
  • PVR2: Fix texcache reset breaking data structure invariants
  • PVR2: Fix FBO reuse when using more than 2 buffer sizes (crash)
  • GUI: Display an error message when unable to run rather than just disabling the button
  • GTK: Remove annoying error messages when loading save-state previews
  • OSX: Add preferences toolbar item to main window
  • OSX: Fix changes to path properties not taking effect until restart
October 16th, 2008 by nkeynes
Feature-freeze for 0.9
Posted in Development

The triangle sort improvements are in now (that being the most requested fix in the poll), which has made a big difference to many scenes. It’s still not 100% correct (and the 100% correct version needs a lot more time than I have right now), but the situations it gets wrong tend not to crop up so often in real scenes. From this point I’m just tracking down a few rendering glitches and related bugs which may or may not be fixed in time for release.

The bad news is that 0.9 will not have complete or perfect PVR2 emulation as originally envisioned. The good news is that most of the features that are still missing now are used fairly rarely, and the overall rendering quality is a huge jump over 0.8.*. And we’ll only continue to improve the quality and performance in following releases, of course.

What didn’t make the cut:

  • Accumulation buffer + bump mapping (although the error dialogs have been killed)
  • Shadow volumes for translucent polygons
  • Strip buffers

Changes

  • PVR2: Render-to-texture actually works now, with some internal simplifications
  • PVR2: Made triangle sorting algorithm approximate correctness a little more closely.
  • SH4: Fixed a few places where the T flag was accidentally being clobbered
  • OSX: Add the default config file into the bundle
October 5th, 2008 by nkeynes
Shadow volumes are in
Posted in Development

for opaque polygons, and they’re looking quite nice. And it didn’t even have as big a performance hit as I had expected. Translucent shadows need to be dealt with separately, alongside the translucent poly sorting (and will probably be quite a bit more expensive, but then again they’re more expensive on the actual PVR2 as well).

Other than that, Real Life(tm) has been quite busy recently, so progress has been a little slower than one might like, but we’re still on track for 0.9 at the moment.

Changes

  • Add opaque shadow volumes
  • Fix some punchout ordering problems
  • Add EXT_packed_depth_stencil support
September 19th, 2008 by nkeynes
New Poll: Feature requests for 0.9
Posted in Development

Now taking requests here: Poll

I’ve been looking into why Ketm (among other games) was showing some strange timing problems – turns out that they’re very sensitive to the exact SH4 CPU speed, presumably due to use of timing loops. Unfortunately the raw instruction rate, like most modern CPUs, is quite complicated, so this is going to take some time to get right. Admittedly we were certainly going to need cycle-accurate emulation eventually though (or at least close-to-cycle-accurate).

Otherwise work on the accumulation buffer + shadow volumes (at least for opaque polygons) is progressing nicely, but nothing to brag about as yet. In a rare bit of good news, it appears that ATI’s latest 8.9 drivers do in fact (at least claim to) support EXT_packed_depth_stencil at long last, meaning that people with ATI cards will actually be able to see the shadow effects now. Well, once they’re working at all, anyway.

No changelog today, as I find the recent lack of commits disturbing…

.