Two games with distinct input methods: pressing the pen on a booklet page to play MIDI notes, or scanning a barcode card to play an event sequence.

After spending a few months reverse engineering Sega Beena, it eventually landed in MAME, so it’s time for a technical retrospective. If nothing else, just to show that there isn’t anything exceptional done here, just “tool-assisted guessing” and being persistent enough to make breakthroughs. On the following sections, I’ll go over my humble approaches, highlighting some of the challenges faced. How humble? Well, this was my average debugging session:

Checking how pen position updates some work RAM variables. At this point, I was missing a priority bit to render the pen above other sprites.

Before we dive in, I want to point out some architectural choices, that might have been motivated in simplifying developer experience. In turn, they also simplified reverser experience!

  • The BIOS has its own share of action: it takes care of enabling/disabling interrupt handling, has a minimal bootloader to load program code either from mask ROM, flash, or some other interface that only existed in development boards, and some utility functions called by games, e.g. decompression;
  • Graphics are based on a custom tile-based engine released as late as 2005. While Sega Toys didn’t have a demanding niche to wow with sophisticated effects, they had lessons learned from previous consoles, which can be fun to compare with;
    • I was somewhat familiar with Sega Megadrive’s VDP, so knowing how a similar engine works helps in recognizing features, such as how multiple tile layers can be manipulated, what kind of scrolling and zoom effects they support, and so on. One notable difference is that tile data isn’t streamed to a single I/O data port on Beena, instead that data is written directly on memory regions. Therefore, taking a memory dump will give you a much better idea of tile layout than tracing references to tile data and their corresponding writes to some memory mapped address;
  • Audio isn’t handled by dedicated chips that require data to be stored in some custom synthesizer format. Instead, well-known file formats are used, such as MIDI, Ogg Vorbis, Sun Microsystems’ Au…
  • Finally, the luxury of a JTAG port for debugging, enabling live memory manipulation, stepping through code… 🥰 anything that is hard to tell from program code analysis alone can fallback to this, such as mirrored memory regions, maximum pen coordinates…

Picking a framework

Emulating ARM CPUs and rendering 2D graphics has been solved several times, so it makes sense to leverage existing work. Such setup can consist of Libretro + Unicorn. I ended up picking MAME, not only because I was already familiar with its internals, but it also addressed limitations of the previous setup:

  • Debugging tools for CPU execution, memory editor, tilemap preview…
  • Primitives for rendering tilemaps, e.g. parsing bits per pixel and arranging tiles in layouts;
  • Using mouse input in a rendered screen, as well as static bitmaps for booklet pages;

Ghidra Love/Hate

Following my previous research, so far we knew:

  • Some I/O ports for page sensors and pad buttons;
  • Data format of uncompressed bitmaps and tiles;

Since I had already started disassembling some games, it would be nice to carry over labels across them. It turns out that whatever SDK / libraries were used had a lot of common functions statically linked with game program code. A perfect fit for Ghidra’s Function ID (FID) database.

I had my worries that it would require masking byte patterns to get flexible matching (an approach I had seen in PSX Loader). Thankfully, it goes well with ARM, due to data addresses being stored at the end of functions and accessed with relative offsets, so even if they change, operand bytes are identical. On the flip side, you get conflicts with functions where only referenced memory addresses changed (e.g. there are two distinct joypads on the console, so getters for left/right pad buttons and left/right pen inputs have identical implementations).

Now, imagine you are analyzing some function, but don’t find any cross-references to some address you are sure must be set at some point, and eventually you bump into this:

No wonder you get missing cross-references… But why would it disassemble a branch instruction in thumb mode, when the previous instruction was a push in ARM mode? Even worse, some subroutines would be disassembled in thumb mode, despite being referenced by a data address where the least significant bit wasn’t set. Unfortunately, both cases happened so frequently that I had to script a workaround to force ARM mode disassembly.

Reset? But I didn’t do anything!

Right off the bat, my driver couldn’t even get out of BIOS code, and ended up in a reset loop. After bisecting the BIOS with breakpoints, the guilty instruction was always the same:

Moving zero to the program counter (R15)... back to the start.

The value on the link register (BIOS runs in Supervisor Mode, so it’s using shadowed register SR14) didn’t seem to be correctly restored after returning from a function call:

LDMFD doesn’t do anything too exciting:

Load Multiple Increment After (Load Multiple Full Descending) loads multiple registers from consecutive memory locations using an address from a base register. The consecutive memory locations start at this address, and the address just above the highest of those locations can optionally be written back to the base register.

I decided to look at 0x20003fc0, where the registers were saved… and then looked at the BIOS RAM memory mapping:

map(0x20000000, 0x200003ff).ram();

Yeah, that wasn’t going up to 0x20003fff… Since the address was unmapped, all stored register state was lost and read back as null bytes, which happens to be the same address as the BIOS reset handler.

MAME actually warns about unmapped memory accesses, but I hadn’t pass the command line option to enable that logging, learned that the hard way! 😅

Recognizing patterns

It doesn’t matter what needle you pick from the haystack, as long as it’s something you can traceback from. Strings are common choices, but in some cases, constants are also fine. For example, I/O registers for a real-time clock were figured out by tracing an in-memory structure, which had fields being compared against well-known datetime limits:

undefined8 parse_clock_regs(rtc_t *clk) {
    // ...
    if (59 < clk->second) { err = 1; }
    if (59 < clk->minute) { err = 1; }
    if (23 < clk->hour) { err = 1; }
    if (30 < clk->mday) { err = 1; }
    if (6 < clk->weekday) { err = 1; }
    if (11 < clk->month) { err = 1; }
    if (2031 < clk->year) { err = 1; }
    if (err == 1) {
        // sanity checks failed, clear all parsed values
        memset(clk,0,0xe);
    }
    return CONCAT44(in_lr,err);
}

Curious to see a check failing after year 2031, well before the signed 32-bit overflow date

Another example: to figure out pen mappings, there was a menu option that when selected with the pen, caused a screen fade out, while playing an .ogg file. It was traced as follows:

  1. Find offsets to .ogg files, just by searching for their header magic bytes;
  2. Extract these files from ROM, and play them until we find the one we are interested in;
  3. Find .ogg address tables referenced in game code, one of the entries has less significant bytes that match the offset of the file we are interested in;
  4. Follow cross-references to the address table, we eventually find an .ogg playback function in program code, where bytes eventually get sent to an I/O port;
  5. Trace instructions while the menu is loaded, then annotate the disassembly with instruction coverage (you might prefer a fancier alternative), one or more of the uncovered branches must require a pen press;
  6. In MAME’s debugger, try to force one of those branches to be taken, take another trace, diff with the previous, check accessed memory structures… one of the conditional statements must be comparing pen coordinates against fixed regions mapped to menu options;
Setting the right bit at the right I/O address, to send a pen down event, to finally get that screen fade out.

It’s a much longer ride, but same principles. Indeed, tracing and diffing gives a lot of bang for the buck, as my previous writeups must have shown by now…

Interrupt me maybe

Screens are commonly updated on vertical blank intervals. Some systems might dedicate an interrupt vector for this, but with ARM, it must be done in one of the general-purpose interrupts: either the regular IRQ or fast IRQ (FIQ), intended for low-latency code. There’s also the topic on when exactly VBLANK happens:

Some examples of differences in blank areas: Atari 2600 vs. Nintendo GBA

Let’s see how handler functions get assigned to interrupts. We can see that BIOS copies handler addresses to the end of BIOS RAM:

void bios_ram_cpy_cbs(void) {
  int i = 0xc;
  undefined4 *src = (undefined4 *)&LAB_000002ac;
  undefined4 *dst = (undefined4 *)&LAB_20003fd0;
  do {
    i = i + -1;
    *dst = *src;
    src = src + 1;
    dst = dst + 1;
  } while (i != 0);
  return;
}

Each interrupt subroutine dereferences the corresponding handler:

                     IRQ
                     XREF[1]:   FUN_00004844:000048a8(R)
00000018 e5 9f f0 34     ldr    pc,[PTR_LAB_00000054] = 20003ff0

In this case, it was initialized with subroutine 00006750:

000002cc e5 1f f0 04     ldr    pc,[LAB_000002d0] = 00006750

This subroutine checks some BIOS variables, before calling the handler itself stored at 20000c5c, if it’s not null:

                     call_0x20000c5c
                     XREF[1]:     000013cc(*)
00005adc e5 9f 00 78     ldr        r0,[PTR_DAT_00005b5c] = 20000c6c
00005ae0 e9 2d 40 10     stmdb      sp!,{r4 lr}
00005ae4 e5 90 00 04     ldr        r0,[r0,#0x4]=>DAT_20000c70
00005ae8 e5 9f 40 70     ldr        r4,[PTR_DAT_00005b60] = 20000c58
00005aec e3 50 00 00     cmp        r0,#0x0
00005af0 05 94 00 04     ldreq      r0,[r4,#0x4]=>DAT_20000c5c = 80025009h
00005af4 08 bd 40 10     ldmiaeq    sp!,{r4 lr}=>local_8
00005af8 01 2f ff 10     bxeq       r0

This address is initialized by game program code:

8002a31e 4b 0c           ldr        r3,[PTR_irq_cb_20000c5c+1_8002a350] = 80025009
; ...
8002a324 60 73           str        r3,[r6,#0x4]=>DAT_20000c5c = 80025009h

We see that quite a lot of video related I/O ports are accessed. For each labelled port, it was a matter of tracing several references, and seeing constants for screen dimensions being compared, or tile data being written:

undefined4 irq_cb_20000c5c(void) {
  iVar5 = read4(DAT_c00d0f20);
  write4(DAT_c00d0f20,iVar5 + 1);
  irq_cb_fade();
  cVar4 = read1(DAT_c00d16f8);
  if (cVar4 != '\0') {
    cVar4 = read1(DAT_c00d0f30);
    if (cVar4 != '\0') {
      write1(DAT_c00d0f30,0);
      dVar1 = read4(w_video_layer_ctrl);
      write4(VIDEO_LAYER_CTRL,dVar1);
      dVar1 = read4(DAT_c00d0ee0);
      write4(VIDEO_BITMAP_FULL_W_H,dVar1);
      uVar3 = read4(DAT_c00d0e20);
      write4(DAT_40000024,uVar3);
      dVar1 = read4(DAT_c00d0ea0);
      write4(VIDEO_BITMAP_MOVE_X_Y,dVar1);
      dVar1 = read4(DAT_c00d0e60);
      write4(VIDEO_BITMAP_CLIP_W_H,dVar1);
      dVar1 = read4(DAT_c00d0e80);
      write4(VIDEO_FADEOUT_STEP,dVar1);
    }
    dVar1 = read4(VIDEO_LAYER_CTRL);
    dVar2 = read4(VIDEO_LAYER_CTRL);
    write4(VIDEO_LAYER_CTRL,dVar2 & 0xfffffffe);
    cVar4 = read1(DAT_c00d0f00);
    if (cVar4 != '\0') {
      bios_strncpy_u64_2((undefined4 *)VIDEO_PALETTE,(undefined4 *)&DAT_c00c7100,0x200);
      write1(DAT_c00d0f00,0);
    }
    cVar4 = read1(DAT_c00d0f28);
    if (cVar4 != '\0') {
      bios_strncpy_u64_2((undefined4 *)VIDEO_TILEMAP_SPRITES,&DAT_c00c6d00,0x400);
      write1(DAT_c00d0f28,0);
    }
    cVar4 = read1(DAT_c00d0f08);
    if (cVar4 != '\0') {
      uVar6 = read4(w_video_layer_ctrl);
      if ((uVar6 & 0x40) == 0) {
        uVar3 = read4(DAT_c00c7c00);
        write4(VIDEO_TILEMAP_MOVE_Y._0_4_,uVar3);
      }
      else {
        bios_strncpy((undefined4 *)VIDEO_TILEMAP_MOVE_Y,&DAT_c00c7c00,0xb4);
      }
      write1(DAT_c00d0f08,0);
    }
    if ((dVar1 & 0x200) == 0) {
      uVar6 = 0x20;
    }
    else {
      uVar6 = 0x2c;
    }
    dVar2 = read4(VIDEO_SPRITE_DELTA_X_Y);
    write4(VIDEO_SPRITE_DELTA_X_Y,dVar2 & 0xffffffc0 | uVar6);
    write4(VIDEO_LAYER_CTRL,dVar1);
    cVar4 = read1(DAT_c00d0f10);
    if (cVar4 != '\0') {
      uVar6 = read4(w_video_layer_ctrl);
      if ((uVar6 & 0x30) < 0x21) {
        if ((uVar6 & 0x30) == 0x20) {
          bios_strncpy((undefined4 *)VIDEO_TILEMAP_MOVE_X,&DAT_c00c7300,0x90);
        }
        else {
          uVar3 = read4(DAT_c00c7300);
          write4(VIDEO_TILEMAP_MOVE_X._0_4_,uVar3);
        }
      }
      else {
        bios_strncpy_u64_2((undefined4 *)VIDEO_TILEMAP_MOVE_X,&DAT_c00c7300,0x900);
      }
      write1(DAT_c00d0f10,0);
    }
    write1(DAT_c00d16f8,0);
  }
  bx_r3_chain_bind_callback();
  return in_lr;
}

It seems that raising an IRQ once every VBLANK was enough to get graphics updated in-sync with what was seen on hardware captures. I did have a machine configuration to increase the frequency of these calls, since screen updates seemed slower than expected, but this was due to some busy-waits in game logic, which I’ll cover later…

Peeling the onion of tile layers

Speaking of graphics, let’s look into how they are actually manipulated. Uncompressed bitmaps are a nice starting point. Even before emulation, I had already rendered a font:

Getting a clearer picture on graphics loading: look for cross-references in code that match starting offsets of some blobs...
Script output with accurate palette.

We can start making some hypothesis about how visible tiles are part of a larger bitplane. After all, some margin must exist for scrolling effects… Again, picking up from previous research, we can look at the minimal example of the “no cart detected” screen. In particular, what these work RAM regions contain:

Tile indexes (least significant bits of 2-byte entries):

C00C8000:  00000000 00000000 00000000 00000000  ................
...
C00C87A0:  00000000 00000000 00000000 00000000  ................
C00C87B0:  00000000 00000000 00000001 00020003  ................
C00C87C0:  00041001 00000000 00000000 00000000  ................
C00C87D0:  00000000 00000000 00000000 00000000  ................
C00C87E0:  00000000 00000000 00000000 00000000  ................
C00C87F0:  00000000 00000000 00000000 00000000  ................
C00C8800:  00000000 00000000 00000000 00000000  ................
C00C8810:  00000000 00000000 00000000 00000000  ................
C00C8820:  00000000 00000000 00000000 00000000  ................
C00C8830:  00000000 00000000 00000005 00060007  ................
C00C8840:  10061005 00000000 00000000 00000000  ................
C00C8850:  00000000 00000000 00000000 00000000  ................
C00C8860:  00000000 00000000 00000000 00000000  ................
C00C8870:  00000000 00000000 00000000 00000000  ................
C00C8880:  00000000 00000000 00000000 00000000  ................
C00C8890:  00000000 00000000 00000000 00000000  ................
C00C88A0:  00000000 00000000 00000000 00000000  ................
C00C88B0:  00000000 00000000 00000008 0009000A  ................
C00C88C0:  10091008 00000000 00000000 00000000  ................
C00C88D0:  00000000 00000000 00000000 00000000  ................
C00C88E0:  00000000 00000000 00000000 00000000  ................
C00C88F0:  00000000 00000000 00000000 00000000  ................
C00C8900:  00000000 00000000 00000000 00000000  ................
C00C8910:  00000000 00000000 00000000 00000000  ................
C00C8920:  00000000 00000000 00000000 00000000  ................
C00C8930:  00000000 00000000 0000000B 000C000D  ................
C00C8940:  100C100B 00000000 00000000 00000000  ................
C00C8950:  00000000 00000000 00000000 00000000  ................
C00C8960:  00000000 00000000 00000000 00000000  ................
C00C8970:  00000000 00000000 00000000 00000000  ................
C00C8980:  00000000 00000000 00000000 00000000  ................
C00C8990:  00000000 00000000 00000000 00000000  ................
C00C89A0:  00000000 00000000 00000000 00000000  ................
C00C89B0:  00000000 00000000 0000000E 000F0010  ................
C00C89C0:  100F000E 00000000 00000000 00000000  ................
C00C89D0:  00000000 00000000 00000000 00000000  ................
...
C00C9FF0:  00000000 00000000 00000000 00000000  ................

Palette indexes (1-byte entries):

; first tile always blank
C0100000:  00000000 00000000 00000000 00000000  ................
...
C01000F0:  00000000 00000000 00000000 00000000  ................

; second tile
C0100100:  0D0D0D0D 0D0D0D0D 0D0D0D0D 0D0D0D0D  ................
...
C0100170:  0D0D0D0D 0D0D0D0D 0D0D0D0D 0D0D0B01  ................
C0100180:  0D0D0D0D 0D0D0D0D 0D0D0D0D 0D090101  ................
C0100190:  0D0D0D0D 0D0D0D0D 0D0D0D0C 08010101  ................
C01001A0:  0D0D0D0D 0D0D0D0D 0D0D0C01 01010101  ................
C01001B0:  0D0D0D0D 0D0D0D0D 0D0C0101 01010101  ................
C01001C0:  0D0D0D0D 0D0D0D0D 0C010101 01010101  ................
...

; end of last tile (index 0x0f)
C01010F0:  0D0D0D0D 0D0D0D0D 0D0D0D0D 0D0D0D0D  ................

; unset
C0101100:  00000000 00000000 00000000 00000000  ................
...

Palette colors (2-byte entries):

40020000:  63187288 72CC7B10 7F547F98 7FDC7FFF  c.r.r�{..T...�.�
40020010:  6E8A6EAD 6AAF66D2 62F56318 00000000  n.n-j�f�b�c.....
40020020:  00000000 00000000 00000000 00000000  ................
...

Putting these pieces together:

Wrangling MAME's gfx_layout structure, from incorrect BGR palette order to missing tile mirror effects, until it was just right.
Basic tile positioning with overlayed hardware captures. Bitplane size was figured out by counting 16x16 grid squares, then comparing with tile indexes in VRAM. These indexes were initialized further than expected, which meant that the visible area was offset from the bitplane.
Wrong video mode, sprites and scrolling unimplemented. Y-axis scroll values are monitored @ 0x40030000. Given that the first logo's position should be a bit up, we know that a negative scroll offset is applied, given by the signed 10-bit value 0x328 (-216). When the second logo appears, it should move down into view, therefore scroll value increases up to 0x3fa (-6). Sprite positioning is absolute, monitored @ 0x40010000.

The example above illustrates my appreciation for live memory views: it’s much easier to eyeball related addresses, specially when you can pair memory changes with what’s simultaneously happening on video output. Much harder to get that from a trace log of memory accesses!

Another minimal example, test mode screens:

Background vs. Foreground. Confirms that tile index 0 (which fills most of VRAM) needs to be transparent, so that elements from both layers are visible.

While some graphics are rendered by placing indexes for each tile, this is not always the case. Some patterns follow pre-defined layouts, set with layout id and starting tile index:

Only one tile position @ 0x40010000, but all 16 tiles were loaded in VRAM. Layout id = 0xF.
Figuring out how some sprite tile patterns are rendered. Unmatched patterns logged with index and position. Tile layers rendered with a darker tint to distinguish from sprites. Don't worry, the ugly switch case was eventually simplified.
All 16 possible tile patterns.

But even this wasn’t enough to cover all cases, there were still missing graphics:

Both tile layers and sprites shown, along with partially implemented scrolling...

Luckily, taking a memory dump and looking at some other referenced working RAM region revealed the missing graphics:

"Direct bitmap" where each color entry takes 2 bytes, seen with BPP = 16.

This memory range definitely doesn’t mirror the framebuffer since data doesn’t get overwritten during video updates, maybe there’s a blitter working under the cover?

It scrolls like this… no like that… no wait…

At some point I wrote scripts for rendering tile layers and direct bitmaps, which culminated in a single script. The idea was to take memory dumps of the corresponding regions, and validate against some assumptions, such as “what’s the full tilemap size”, “should tiles wrap-around when scrolling? if so, which offsets? both x-axis and y-axis?”…

Here’s an example, where all required ROM offsets are passed to render some stripes, shown on the title screen of Pocket Monsters Best Wishes! Chinou Ikusei Pokémon Daiundoukai:

./render.py \
   --rom S-100042-1000.bin \
   --tiles_offset 0x52a23 \
   --tilemap_offset 0x55c9c \
   --palette_offset 0x4c67c 0x4c71c \
   --bg=0 \
   --raw \
   --out=S-100042-1000.stripes.png

Note the numeric markings that aren’t displayed in-game, likely used to identify each frame of the animation. This was actually found by accident, while testing scroll offsets:

Using keybindings defined via input ports to manually move tiles. An environment variable is used to conditionally toggle this functionality when starting MAME, to avoid recompiling just for these changes.

More interesting are the full tilemaps from several games:

Manually added rectangle overlays for visible area and max scroll area.

We can see how Cars 2 Racing Beena: Mezase! World Champion! stores all frames of an animated road background, using y-axis scrolling to pick a frame to display. The Pokémon game splits the screen in two halves, each one scrolled independently:

Background + Foreground x-axis scroll data with same displacement applied to every line of each half.

These games required scroll wrap-around, and my assumption was that it happened at max scroll area boundaries, which didn’t really match what the games expected:

Just one example of messing around with displacements, works for some cases, breaks for others...

After spending a lot of time adjusting these boundaries, eventually I realized, maybe it was simpler: how about we just scroll at the full tilemap boundaries, only taking into account visible area offsets?

Finally, it works!

Old trick, but good to know

Even when handling all written data seen in irq_cb_20000c5c(), there were still some elements that appeared to not be rendered, such as this gradient background:

Emulated vs. Hardware capture.

The only hint that something was off were 2 suspiciously unset palette entries at offsets 0xd2 and 0xd3:

Looks like a good place to start. I tried setting a memory write breakpoint in one of the unset entries, but it wasn’t hit… then I tried a tile address, and hit a function were all involved graphics are set:

undefined4 FUN_80009840(void) {
  // ...
  alloc_tiles_pre1b(&PTR_w_green_bg_words_800bfef8);
  alloc_tiles_pre1(&PTR_w_tiles_ribbon_box_800bf29c);
  uVar4 = compute_offset_for_tiles(0,7,8);
  match_meta_w_pal_n_tiles(&w_meta_ribbon_box,0x4001,uVar4);
  uVar4 = compute_offset_for_tiles(1,0,0);
  match_meta_w_pal_n_tiles(&w_meta_green_bg,0x87,uVar4);
                    /* 3 dogs sprites */
  alloc_tiles(**(undefined4 **)(iVar1 + 0x50),0x62);
  alloc_tiles(*(undefined4 *)(*(int *)(iVar1 + 0x50) + 8),0x7e);
  alloc_tiles(**(undefined4 **)(iVar1 + 0x54),0x1ed);
  alloc_tiles(*(undefined4 *)(*(int *)(iVar1 + 0x54) + 8),0x201);
  alloc_tiles(**(undefined4 **)(iVar1 + 0x58),0x32a);
  alloc_tiles(*(undefined4 *)(*(int *)(iVar1 + 0x58) + 8),0x33d);
  ext_210_pre1(&w_pal_ribbon_box,1);
  ext_210_pre1(&w_pal_green_bg,0x60);
  prepare_video_cbs(0xd2);
  // ...
}

The argument passed to prepare_video_cbs() grabbed my attention:

undefined4 prepare_video_cbs(int param_1) {
  // ...

  iVar3 = read4(DAT_c00cde08);
  if (iVar3 != 0) {
    set_video_cbs();
  }
  write4(DAT_c00cde08,1);
  i = FUN_800270b4();
  if ((i & 1) == 0) { // (1)
    cb = &DAT_80023b98;
  }
  else {
    cb = &DAT_80023c2c;
  }
  write4(w_video_cb1,(uint)*cb);
  // ...
  i = FUN_800270b4();
  if ((i & 1) == 0) { // (2)
    cb = &DAT_80023bdc;
  }
  else {
    cb = &DAT_80023c70;
  }
  write4(w_video_cb2,(uint)*cb);
  // ...
  puVar5 = &DAT_c00cd674;
  puVar6[3] = &DAT_c00cddf4;
  *puVar6 = &DAT_c00cd674;
  uVar7 = *(undefined4 *)((int)&DAT_c00c7100 + (param_1 << 1 & 0x1fcU)); // (3)
  i = 0;
  do {
    i = i + 1;
    *puVar5 = uVar7;
    puVar5 = puVar5 + 1;
  } while (i < 0x1e0);
  // ...

That argument is used to initialize some work RAM area at (3), but let’s instead focus on (1) and (2). Assigned ROM addresses point to some structures containing callbacks:

80023c2c 00 40           undefined2 0040h ; function length (code + data addresses)
80023c2e 00 30           undefined2 0030h ; code length
; start of function
80023c30 e2 8f 20 28     adr        r2,PTR_VIDEO_BITMAP_80023c60
; ...

Each work RAM callback address is then used to copy callback functions to BIOS RAM, starting at 0x20003ecc (here we only see the variables involved, not the actual copy):

int set_video_cb(int param_1) {
  iVar1 = read4(DAT_c00cf9fc);
  write4(DAT_c00cf9fc,iVar1 - param_1);
  return (int)&DAT_20003ecc - (iVar1 - param_1);
}

We can dump BIOS RAM after running this code, and disassemble one of these callbacks:

void FUN_20003e8c(void) {
  clk = (*(uint *)PTR_VIDEO_PIXEL_CLOCK_20003ec4 >> 0xe) - 41Ch;
  dVar1 = 41Ch;
  if (-1 < clk) {
    dVar1 = 0;
  }
  uVar2 = clk + dVar1;
  clk = uVar2 - 0x50;
  if (uVar2 < 0x50) {
    clk = 0;
  }
  *(undefined4 *)PTR_VIDEO_PALETTE[210]_20003ec8 = *(undefined4 *)(PTR_DAT_20003ebc + clk);
  return;
}

But a breakpoint set on this function wasn’t hit either, although this is where those unset palette entries get assigned… there were 2 missing ingredients:

  1. This callback is only run when the CPU is handling a FIQ request, but when exactly should we raise it?
  2. The palette value seems to the depend on an I/O address which turned out to be a pixel clock. But what’s the range of values of that clock?

We can look at this puzzle from another angle: recall the initialized work RAM area at (3), which started at 0xc00cd674, pointed by 0x2003ebc in the snippet above. Let’s check a dump with as many entries as needed to fill all lines (0x3c0 / 4 = 240):

00000000: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000010: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000020: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000030: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000040: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000050: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000060: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000070: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000080: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000090: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000000a0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000000b0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000000c0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000000d0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000000e0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000000f0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000100: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000110: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000120: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000130: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000140: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000150: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000160: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000170: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000180: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000190: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000001a0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000001b0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000001c0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000001d0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000001e0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000001f0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000200: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000210: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000220: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000230: 3a3f 0000 3a3f 0000 3e3f 0000 3e5f 0000  :?..:?..>?..>_..
00000240: 425f 0000 427f 0000 467f 0000 469f 0000  B_..B...F...F...
00000250: 4a9f 0000 4a9f 0000 4ebf 0000 4ebf 0000  J...J...N...N...
00000260: 52df 0000 52df 0000 56ff 0000 56ff 0000  R...R...V...V...
00000270: 5b1f 0000 5f1f 0000 5f1f 0000 633f 0000  [..._..._...c?..
00000280: 633f 0000 675f 0000 675f 0000 6b7f 0000  c?..g_..g_..k...
00000290: 6b7f 0000 6f7f 0000 6f9f 0000 739f 0000  k...o...o...s...
000002a0: 73bf 0000 77bf 0000 77df 0000 7bdf 0000  s...w...w...{...
000002b0: 7fff 0000 7bdf 0000 77df 0000 77bf 0000  ....{...w...w...
000002c0: 73bf 0000 739f 0000 6f9f 0000 6f7f 0000  s...s...o...o...
000002d0: 6b7f 0000 6b7f 0000 675f 0000 675f 0000  k...k...g_..g_..
000002e0: 633f 0000 633f 0000 5f1f 0000 5f1f 0000  c?..c?.._..._...
000002f0: 5b1f 0000 56ff 0000 56ff 0000 52df 0000  [...V...V...R...
00000300: 52df 0000 4ebf 0000 4ebf 0000 4a9f 0000  R...N...N...J...
00000310: 4a9f 0000 469f 0000 467f 0000 427f 0000  J...F...F...B...
00000320: 425f 0000 3e5f 0000 3e3f 0000 3a3f 0000  B_..>_..>?..:?..
00000330: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000340: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000350: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000360: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000370: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000380: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000390: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000003a0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000003b0: 0000 0000 0000 0000 0000 0000 0000 0000  ................

With a little script, let’s render each 2-byte palette entry as a line, and confirm that these entries match the position where the gradient should start, which is line 140 (or 280 in scaled video mode):

An approximation was implemented in the driver: we call FIQ at every scanline, and the pixel clock value is fitted to fall into one of the entries above (yeah, not even an actual timer):

m_video_regs[0x4/4] = 0x10000 * (20 + m_scanline);

Whenever a palette register gets written, copy the color to a cache that has size number_of_scanlines * num_palette_entries, which is then lookup when rendering tiles:

m_cache_palette[m_scanline * 0x100 + offset] = color;

So the game was overwriting CRAM while each scanline was being drawn. Nice to see that a trick used a decade before in the Mega Drive days still found its use in a newer system.

Choppy modulation

Moving on to audio emulation. There’s quite a bit of low-latency work done at software level: the BIOS can take an .ogg file, convert to PCM, and stream directly to an output port. With an emulated CPU running at a relatively high clock speed of 81 MHz (interpreted, not JIT’d!), it becomes a challenge for real-time audio playback.

As usual, it’s better to confirm where exactly the bottleneck is. Let’s start with a minimal example: generating a simple waveform and outputting it. I used as reference bisqwit’s tutorial on PCM audio, in particular, adapting the tiny generator to MAME land:

void ap2010pcm_device::sound_stream_update(sound_stream &stream, std::vector<read_stream_view> const &inputs, std::vector<write_stream_view> &outputs)
{
    auto &buffer = outputs[0];
    buffer.fill(0);

    /* loop over samples on single channel */
    for (size_t i = 0; i < buffer.samples(); i++)
    {
        int16_t v = ((wave_counter / 8000) % 2 == 0) ? -(32768/2) : (32768/2);
        wave_counter += 440*2;

        buffer.put_int(i, v, 32768);
    }
}

This sounded fine. Going back to game program code, here’s the function that takes a pointer to PCM bytes @ 0x20000cf0 and streams 2 bytes at a time to output port PCM_DATA @ 0x5001000c (these are passed without reordering, so 16-bit big endian):

FUN_8002e0e8() {
  // ...
  i = read4(PCM_STATUS);
  uVar8 = 0;
  if ((i & 1) == 0) {
    uVar2 = 0xffffffff;
  }
  else {
    if (DAT_20000cc0 == 0) goto LAB_8002e188;
    if (0 < (int)DAT_20000cd4) {
      puVar7 = DAT_20000cd4;
      do {
        i = 0;
        if (0 < (int)(bram_pcm_chunk_len * bram_pcm_chunk_n)) {
          puVar6 = &DAT_20000cf0;
          do {
            if ((i & 1) == 0) {
              uVar8 = (uint)*(ushort *)((int)puVar6 + 2);
            }
            else {
              write4(PCM_DATA,uVar8 << 0x10 | (uint)*(ushort *)((int)puVar6 + 2));
            }
            i = i + 1;
            puVar6 = puVar6 + 1;
          } while ((int)i < (int)(bram_pcm_chunk_len * bram_pcm_chunk_n));
        }
        puVar7 = puVar7 + -1;
      } while (puVar7 != (undefined *)0x0);
    }
    // ...
}
Trace log of writes to PCM_DATA, compared against a hex dump of some example 16-bit PCM file with similar values.

We can extract one of these .ogg files, and convert them to raw PCM with a 16k bitrate in Audacity, then listen to them with play -t raw -r 16k -e signed -b 16 -c 1 foo.pcm. After updating our driver’s audio stream to instead use data sent by games, it revealed some choppiness:

MAME vs. play.

Since our driver outputs samples at a constant bitrate, some of them ended up missing when playing 16k bitrate files. This didn’t happen with 8k bitrate files (one of them starts playing around the last third of the waveform above), likely due to less data to process in the same amount of time.

This is illustrated on the following trace log, where we count how many buffered samples are silence when the sound stream consumes the buffer (if all samples are silence, no count is logged), and we can tell there are quite a few gaps:

stream start
stream AP2010 PCM ':pcm' changing rates 8000 -> 16000
stream 0s = 103 (had 217, needed 320)
stream 0s = 254 (had 57, needed 311)
stream 0s = 103 (had 217, needed 320)
stream 0s = 256 (had 58, needed 314)
stream 0s = 103 (had 217, needed 320)
stream 0s = 252 (had 57, needed 309)

CPU profiling is the usual approach for investigating performance issues. I decided to compare my driver against the one for Game Boy Advance, another ARM7TDMI based console, which definitely didn’t suffer from audio playback issues.

I collected samples with perf record -F 99 -p $(pgrep mamed) -g -- sleep 60. First, for GBA driver:

Samples: 987  of event 'cycles', Event count (approx.): 24469431386
  Children      Self  Command     Shared Object           Symbol
+   22.43%     0.00%  mamed       [unknown]               [.] 0xf834bbc0f834bba8
+   19.74%    19.74%  mamed       mamed                   [.] gba_lcd_device::draw_bg_scanline
+   13.24%    12.96%  mamed       mamed                   [.] arm7_cpu_device::execute_run
+   10.29%    10.21%  mamed       mamed                   [.] arm7_cpu_device::update_insn_prefetch
+    6.73%     6.73%  mamed       mamed                   [.] gba_lcd_device::draw_scanline
+    5.69%     5.59%  mamed       mamed                   [.] sound_stream::update_view

Then, 3 samples collected for Beena, where we can see driver specific code getting less time compared to the CPU interpreter:

# Data files:
#  [0] perf.nospeedup1.data (Baseline)
#  [1] perf.nospeedup2.data
#  [2] perf.nospeedup3.data
#
# Baseline/0  Delta Abs/1  Delta Abs/2  Shared Object           Symbol                                                                                    >
# ..........  ...........  ...........  ......................  ..........................................................................................>
#
      25.68%       +0.28%       +0.12%  mamed                   [.] arm7_cpu_device::update_insn_prefetch
      18.31%       -0.59%       -0.78%  mamed                   [.] arm7_cpu_device::execute_run
      12.02%       -0.38%       -0.49%  mamed                   [.] std::_Function_handler<unsigned int (unsigned int), arm7_cpu_device::device_start()::{>
       5.75%       -0.39%       -0.63%  mamed                   [.] (anonymous namespace)::sega_beena_state::draw_layer
       5.72%       +0.19%       +0.26%  mamed                   [.] handler_entry_read_memory<2, 0>::read
       5.46%       +0.65%       +0.12%  mamed                   [.] arm7_cpu_device::HandleALU
       4.65%       +0.07%       +0.22%  mamed                   [.] arm7_cpu_device::HandleMemSingle
       3.69%       +0.38%       +0.35%  mamed                   [.] (anonymous namespace)::sega_beena_state::screen_blend
       3.54%       +0.04%       +0.05%  mamed                   [.] arm7_cpu_device::arm7ops_0123

I also confirmed this bottleneck with a minimal driver, where the CPU runs a single busy wait, implemented as an instruction that branches to itself:

Emulation speed dropped to around 45-55%.

This showed how much the driver would slow down on busy loops, but we still don’t know where they happen in BIOS or program code. I used a crude approach to identify these: on MAME’s debugger, take an instruction trace with trace foo.log,,noloop, and get the most frequently hit instructions with cut -c-8 foo.log | sort | uniq -c | sort -V.

While some instructions were part of a subroutine for BIOS Ogg Vorbis decoding, those weren’t the most frequent ones… Remember earlier when I talked about graphics updating slower than expected? Turns out that the worst hotspot also affected those updates! It was a busy loop that I documented on the driver itself:

/*
    All games execute a busy wait until the next IRQ request is served.
    This can lead to significant downgrade of emulation speed.
    The busy wait subroutine is copied to a dynamic location in work RAM,
    somewhere after 0xc00cc000, but before the stack pointer. r0 stores
    an address to a variable that is updated by the game's IRQ callback
    when video data has been processed:
        e3 a0 30 01   mov     r3,#0x1
        e5 c0 30 00   strb    r3,[r0,#0x0]
        e5 d0 30 00   ldrb    r3,[r0,#0x0]
        e3 53 00 00   cmp     r3,#0x0
        1a ff ff fc   bne     LAB_c00ce8bc
    Epilogue is the following for most games:
        e5 9f 30 00   ldr     r3,[DAT_c00ce8d0] = 80000000h
        e5 93 f0 08   ldr     pc=>LAB_c00fff80,[r3,#offset ->SP]
    But slightly different in early games:
        e1 2f ff 1e   bx      lr
    Since this code has a predictable byte signature, we can search
    in memory to find its exact start address, then consume enough cycles to
    reduce the number of instructions executed until the next IRQ is asserted.
*/
if (m_irq_wait_start_addr == UNKNOWN_ADDR) {
    if (m_maincpu->pc() > 0xc00cc000 && m_maincpu->pc() < 0xc00fff80) {
        const uint32_t IRQ_WAIT_SIGNATURE[] = {
            0xe3a03001,
            0xe5c03000,
            0xe5d03000,
            0xe3530000,
            0x1afffffc
        };
        int8_t addr_delta = 8;
        uint32_t *shared32 = reinterpret_cast<uint32_t *>(m_workram.target());
        uint32_t candidate_start_addr = m_maincpu->pc() - addr_delta;
        uint32_t candidate_offset = (candidate_start_addr - 0xc00cc000) / 4;
        for (size_t i = 0; i < addr_delta; i++) {
            bool matched = true;
            for (size_t sig_i = 0; sig_i < 5; sig_i++) {
                if (IRQ_WAIT_SIGNATURE[sig_i] != shared32[candidate_offset + i + sig_i]) {
                    matched = false;
                    break;
                }
            }
            if (matched) {
                m_irq_wait_start_addr = candidate_start_addr + i;

                for (size_t sig_i = 0; sig_i < 5; sig_i++) {
                    m_maincpu->add_hotspot(candidate_start_addr + i + sig_i * 4);
                }
            }
        }
    }
}

As a workaround, interpreter cycles can be consumed whenever the hotspot loop gets hit:

void ap2010cpu_device::execute_run()
{
    for (size_t i = 0; i < ARM7_MAX_HOTSPOTS; i++) {
        if (m_hotspot[i] == 0) {
            break;
        }
        if (m_hotspot[i] == pc()) {
            int32_t icount = *m_icountptr;
            if (icount > 30) {
                eat_cycles(icount - 30);
                break;
            }
        }
    }

    arm7_cpu_device::execute_run();
}
void eat_cycles(int cycles) noexcept {
    if (executing()) {
        *m_icountptr = (cycles > *m_icountptr)
            ? 0
            : (*m_icountptr - cycles);
    }
}

We reduce the number of instructions that get effectively interpreted until the next VBLANK is raised, allowing program code to break out of the busy loop:

void arm7_cpu_device::execute_run() {
    do {
        update_insn_prefetch(pc);
        if (T_IS_SET(m_r[eCPSR])) {
            if (!insn_fetch_thumb(raddr, insn))
                // ...
        }
        else {
            if (!insn_fetch_arm(raddr, insn))
                // ...
        }
        // ...
        m_icount -= 3;
    } while (m_icount > 0);
}

If we take another 3 profiler samples, now with this hack enabled via machine configuration, we no longer have the CPU interpreter at the top:

# Data files:
#  [0] perf.speedup1.data (Baseline)
#  [1] perf.speedup2.data
#  [2] perf.speedup3.data
#
# Baseline/0  Delta Abs/1  Delta Abs/2  Shared Object           Symbol                                                                                    >
# ..........  ...........  ...........  ......................  ..........................................................................................>
#
      15.44%       +1.00%       +0.41%  mamed                   [.] (anonymous namespace)::sega_beena_state::draw_layer
      14.92%       +0.10%       +0.72%  mamed                   [.] arm7_cpu_device::update_insn_prefetch
      12.84%       -1.06%       -0.91%  mamed                   [.] arm7_cpu_device::execute_run
       9.88%       -0.17%       -0.54%  mamed                   [.] (anonymous namespace)::sega_beena_state::screen_blend
       6.48%       -0.39%       -0.37%  mamed                   [.] std::_Function_handler<unsigned int (unsigned int), arm7_cpu_device::device_start()::{>
       5.19%       -0.01%       -0.21%  mamed                   [.] storyware_device::screen_update

Another alternative would be for players to underclock the CPU in the existing cheat configurations, but it leads to other accuracy issues (e.g. scrolling background desync / lags).

While this solved graphic updates, it didn’t actually help with audio decoding subroutines, so some choppiness still exists. A cached interpreter could help here. MAME already does instruction prefetching up to a given number of instructions, which should cover smaller tight loops. A first attempt at increasing the default depth didn’t improve performance. Maybe implementing a JIT would be the best solution? There are some recompilers out there like dynarmic, but their supported guest architectures are much newer than ARM7T/ARMv4T…

Homebrew unlocked

JTAG opens the possibility of writing arbitrary code to RAM, but where exactly can we write without affecting program code? How about the direct bitmap RAM region? Games don’t really have to use it, they can render graphics using just tile layers, which are initialized in distinct regions. If our homebrew only uses tile layers, we have around 0xc6000 bytes of direct bitmap free for code and data, not bad at all! I’ve documented how it gets loaded under dump_from_bitmap.

One of the neater payloads is midi_dump_from_bitmap, where I use tiles from a game’s test mode to render a hex dump of I/O addresses mapped to MIDI registers, updated on each VBLANK:

In GDB, while stopped on a breakpoint, I play a note by writing the MIDI message to the corresponding I/O address @ 0x7000000c. Up to 32 MIDI voices can be tracked @ 0x70010000..0x70010580, 0x2c bytes per voice. After continuing execution, only a single voice is updated.

I’ve also used a GDB script to play notes for each General MIDI instrument. There’s an unused debug feature in Kazoku Minna no Nouryoku Trainer that also tested a subset of instruments, and this script confirmed that the missing instrument numbers would just play the next available instrument on hardware:

Channel 0 vs. channel 9 instruments, note how some waveforms are repeated.

Next steps

Of course, there’s still bugs to fix and peripherals left to be emulated, but with all that talk about MIDI, you will unfortunately find it missing on MAME’s November release. While I have an outdated branch with MIDI output implemented, it doesn’t actually emulate the synthesizer, instead it just sends messages to an external MIDI server. Needless to say, it won’t be accurate, even if one tries to make a soundfont out of notes played on hardware.

I never found any MIDI program code mapped in memory, only the actual PCM samples used by instruments, along with some parameters likely used for effects:

Parameters at the beginning, followed by PCM data.

If you recognizing anything from this hex dump snippet, let me know!

00000000: 0998 03ff 20b8 010a 60b8 020a a0b8 030a  .... ...`.......
00000010: e998 03ff 0338 1bff 0338 1fff 1ad8 06e0  .....8...8......
00000020: 1ad8 06ec 1ad8 06e4 1ad8 070c 12d8 0710  ................
00000030: fff9 fc08 1ad8 0000 0998 43ff 0338 27ff  ..........C..8'.
00000040: 0b98 0e6f 0338 2fff 0999 cbff 12d8 0712  ...o.8/.........
00000050: 0b98 0fc0 0338 33ff 0338 3bff fffe 0154  .....83..8;....T
00000060: fffe 014c fffc 00b1 e358 1bff ffe1 23ff  ...L.....X....#.
00000070: e350 83ff fffe 0059 eb58 1bff e358 03ff  .P.....Y.X...X..
00000080: ffe1 23ff ffff ffff eb51 4fff eb98 06dc  ..#......QO.....
00000090: 1f18 0fff fab8 03ff 1e58 8dcc 19f8 03ff  .........X......
000000a0: fab8 33ff fffe 0260 eb58 03ff eb58 0fff  ..3....`.X...X..
000000b0: ffe1 23ff ffff ffff fff3 a3ff fffe 0119  ..#.............
000000c0: e358 2bff ffe1 23ff fff3 23ff fffe 00f7  .X+...#...#.....
000000d0: e358 17ff ffe1 23ff fff2 a3ff 1e58 0710  .X....#......X..
000000e0: 11f8 03ff 1e58 06e4 1198 00e9 1e41 070c  .....X.......A..
000000f0: 19f0 c3ff 1e58 0712 11f8 03ff 12d8 0710  .....X..........
00000100: 0998 03ff 1ad8 070c 0518 03ff ffe1 43ff  ..............C.
00000110: fff0 e3ff eb58 17ff eb98 06dc fe38 13ff  .....X.......8..
00000120: 19f8 03ff fab8 03ff fffe 0260 e358 0bff  ...........`.X..
00000130: e341 37ff ffff ffff fff0 83ff ffe1 03ff  .A7.............
00000140: fff0 43ff fffc 009b 1f18 27ff fff9 ec88  ..C.......'.....
00000150: fff8 0007 fff8 0012 1ad8 0000 fbf8 03ff  ................
00000160: fffc 001a 1f18 07ff 2998 07ff 0538 03ff  ........)....8..
00000170: ffe1 03ff ffff ffff fffe 014c 1f18 07ff  ...........L....
00000180: 2998 23ff 0538 03ff ffe1 03ff ffff ffff  ).#..8..........
00000190: fffe 0154 1f18 07ff 2998 13ff 0538 03ff  ...T....)....8..
000001a0: ffe1 23ff ffff ffff fff1 83ff 0998 03ff  ..#.............
000001b0: 1ad8 06e4 1ad8 06e0 12d8 0710 0b98 0fc0  ................
000001c0: 0338 33ff 0338 3bff eb58 2bff eb58 17ff  .83..8;..X+..X..
000001d0: eb58 1bff fffc 00b1 1f18 07ff 2998 0bff  .X..........)...
000001e0: 0538 03ff ffe1 23ff eb5a 1bff 0998 03ff  .8....#..Z......
000001f0: 1ad8 06e4 1ad8 06e0 1ad8 06ec 12d8 0710  ................