Reading about unused content in video games lead me to look into one that I played many years ago. We will explore the file formats it used for storing graphics and text, starting with black-box approaches, then tackling the trickier parts by disassembling the game’s executable.

How it started

Mission Critical was a sci-fi adventure game for MS-DOS. The box cover featured this pre-release screenshot, which was also available in a slideshow demo:

Demo screenshot

One of the more striking differences are in two items: there’s a welding torch and a spray that were redesigned in the final game.

I always wondered: could these and other unused graphics still be in the game?

Identifying the right files

There are multiple files with unfamiliar extensions, such as .PIC, .FNT, .Q. Probably .PIC stands for “picture”, but which one would contain the items? Tools such as file or binwalk do not recognize these formats, and in most cases there are no apparent magic byte sequences to aid us.

We can start by checking which files are open by the game at a given time. Filesystem operations under MS-DOS are similar to what we would find in Windows or Linux. Instead of using procmon or strace to trace these calls, we use a debugger. A straightforward way to run MS-DOS executables is to emulate them under DOSBox, which comes with its own debugger. By default, it logs open file operations. After starting the game and loading a save file:

    608656: FILES:file open command 0 file C:\DOS\MISSION\MISSION.EXE
   9428268: FILES:Special file open command 10 file AUTORUN.LOC
   9454641: FILES:file open command 0 file LEGEND.INI
   9552144: FILES:file open command 0 file MDI.INI
   9578906: FILES:file open command 0 file SBPRO2.MDI
   9588505: FILES:file open command 0 file SBPRO2.MDI
   9597085: FILES:file open command 0 file SBPRO2.MDI
   9648105: FILES:file open command 0 file SAMPLE.OPL
   9972960: FILES:file open command 0 file SF.XMI
  10134070: FILES:file open command 0 file DIG.INI
  10160307: FILES:file open command 0 file SB16.DIG
  10170579: FILES:file open command 0 file SB16.DIG
  10179039: FILES:file open command 0 file SB16.DIG
  10225938: FILES:file open command 0 file D:\MISSION\MC002.VOC
  15273849: FILES:file open command 0 file object.dat
  15287650: FILES:file open command 0 file MCSTR.DAT
  15318919: FILES:file open command 0 file MC001.FNT
  15554514: FILES:file open command 0 file C:\DOS\MISSION\MC001.PIC
  19935397: FILES:file open command 1 file C:\DOS\MISSION\RESTART.DAT
  38381383: FILES:file open command 0 file C:\DOS\MISSION\MC000.SAV
  38404344: FILES:file open command 0 file C:\DOS\MISSION\MC001.SAV
  38429677: FILES:file open command 0 file C:\DOS\MISSION\MC002.SAV
  38452868: FILES:file open command 0 file C:\DOS\MISSION\MC003.SAV
  38476938: FILES:file open command 0 file C:\DOS\MISSION\MC005.SAV
  99316233: FILES:file open command 0 file C:\DOS\MISSION\MC003.SAV
  99482009: FILES:file open command 0 file D:\MISSION\MC002.VOC
 126593126: FILES:file open command 0 file MC010.FNT
 126746260: FILES:file open command 0 file C:\DOS\MISSION\MC001.PIC
 126769437: FILES:file open command 0 file MC002.FNT
 130058967: FILES:file open command 0 file MC003.RGN
 130265260: FILES:file open command 0 file C:\DOS\MISSION\MC003.PIC
 145612958: FILES:file open command 0 file MC001.FNT

Let’s rule out sound files (due to their magic bytes and extensions): .DIG, .MDI, .OPL; config files (plaintext): .LOC, .INI; save files (they match how many saved games I had): .SAV.

There’s an elevator in-game that takes you to several floors. It helps seeing what gets loaded when you arrive at each one:

  • Floor 2: MC002.PIC, MC003.PIC, MC001.PIC;
  • Floor 3: MC002.PIC, MC004.PIC, MC001.PIC;
  • Floor 4: MC002.PIC, MC005.PIC, MC001.PIC.

Some common .PIC files are open in all of these floors. Now we have some candidates to inspect.

Visualizing byte clusters

Faced with unknown file formats, if they are:

  • archives, strings could gives us matches for filename entries, with small offsets between them. Otherwise, we could try identifying increasing values in metadata: offsets of file blocks;
  • bitmaps, we can observe in a hex dump sequences of 1 byte (Monochrome, each pixel encoded per bit), 3 bytes (RGB), or 4 bytes (RGBA), which would have the same or close values for regions of an image that are colored the same or with gradients. We should have as many of these sequences as there are pixels in the image (i.e. width * height pixels);
  • compressed / encrypted, there is a high entropy of byte values, as these algorithms shouldn’t generate sequences of bytes with the same values.

These patterns can be visualized with binvis.io, an online tool that not only colors distinct byte ranges, but also arranges them in a Hilbert space-filling curve, which by preserving locality, makes clusters evident.

MC001.PIC:

Curve for MC001.PIC

MC003.PIC:

Curve for MC003.PIC

The following clusters can be observed in order:

  • Sparse values at the beginning (most likely metadata);
  • Padding (null bytes);
  • Groups of:
    • Low valued blocks;
    • Padding;
    • High entropy blocks.

In the hex dump of MC001.PIC, we can see 3 byte sequences in the low valued block. It’s much smaller than the compressed block, so not bitmaps. Maybe it’s a palette? We can confirm that by setting a max value for blue on each sequence (diff against the original file shown with dhex):

Changed bytes for blue values in MC001.PIC

Which does result in a blue tinted dialog:

Blue tinted menu

Parsing .PIC

We can tell that these files are archives, since most of them contain multiple palettes followed by compressed chunks, so the metadata should contain offsets to data chunks.

However, we don’t know if they are absolute offsets or relative to the metadata entries. There’s also endianness to account for. To reduce the guessing involved, we can handle the aforementioned cases with some shell scripting. I took the first palette offsets in MC003.PIC (they all seem to start with 00 00 00 03 00 00 00, but I wasn’t sure if the first bytes were padding, so I picked offsets for 03 00 00 00): 0x1403, 0x1282b, 0x2d573, 0x418e9, and run:

for i in 1403 01282b 02d573 0415e9; do
    # iterate offset range
    for j in $(seq $((0x$i-3)) $((0x$i+3))); do
        # zero-pad odd sized hex values
        j=$(printf '%X\n' "$j" | sed 's/^\(.\(..\)*\)$/0\1/g')
        # iterate endianness
        for k in "$j" "$(printf '%s' "$j" | tac -rs ..)"; do
            binwalk -R "$(printf '%s' "$k" | sed 's/\(..\)/\\x\1/g')" MC003.PIC
        done
    done
done | sort -V | awk '/0x[0-9A-Fa-f]+/{printf "%8s %s\n", $2, $5}'

Which returned these matches (filtered those that extended beyond the metadata, i.e. after 0x1400):

  0x0 (\x00\x14)
  0x1 (\x14\x00)
 0x28 (\x28\x28\x01)
 0x50 (\x70\xD5\x02)
 0x8D (\x14\x04)
 0xB4 (\xE6\x15\x04)
0x2BE (\x14\x00)
0x6B1 (\x00\x14)
0x6B2 (\x14\x00)
0x997 (\x01\x14)
0x998 (\x14\x01)

We can tell the first match starts at 0x0, since the second offset at 0x28 doesn’t have the additional off-by-one match at 0x29. Let’s also rule out matches for the same pattern if they already appeared before:

  0x0 (\x00\x14)
 0x28 (\x28\x28\x01)
 0x50 (\x70\xD5\x02)
 0xB4 (\xE6\x15\x04)

All offsets were found! It seems they do start with 00 00 00 03 (due to matching 3 bytes before the given offsets), and are encoded in little-endian.

After highlighting these offsets (magenta) and some common patterns (blue) in a hex dump:

Metadata entries in MC003.PIC

The entries are somewhat scattered. However, we can assume they have a constant length of 0x14, since strictly increasing offsets can still be observed for each first 4 bytes. This means multiple compressed blocks can use the same palette entry.

Decompressing graphics

By taking the differences in offsets we discovered, we know each compressed block length (if a palette is expected, we discount 0x304 bytes, since next_offset = start_offset + 0x304 + compressed_size, where compressed_size is the value at entry[0x4:0x8]).

But what was used to generate these blocks? Running strings on MISSION.EXE gives us a hint:

Crusher! Data Compression Toolkit Version 3.0

Luckily, I came across a compatible interface for this library. After extending it with some additional functions, I wrote a bare-bones CLI to easily compress and decompress in this format.

The decompression API takes both the compressed and decompressed sizes. To identify the latter, we can use the same approach as before: search for metadata offsets that contain either width * height values or separate values for width and height. Let’s take note of some image sizes from an in-game screenshot:

Sizes of some in-game images

Going back to the hex dump above, highlighted in blue there are little-endian values 0x280, 0x120 which happen to match the width and height of the background image. With all variables identified, we can decompress a chunk, and here’s part of the hex dump of the first one in MC003.PIC:

00000000: 4848 4848 4848 4848 4848 4848 4a63 b244  HHHHHHHHHHHHJc.D
00000010: 7444 b263 446a 4c4a 554a 4c4a 4c4a 4c6b  tD.cDjLJUJLJLJLk
00000020: 4a4c 6b4a 6b4a 6b4a 6b4a 6b06 6b6b 066b  JLkJkJkJkJk.kk.k
00000030: 6b06 536b 5306 5353 5353 7663 53f2 534c  k.SkS.SSSSvcS.SL

Repeated bytes suggest the decompression worked. Using gimp, we can open this file as Raw Image Data, and get a recognizable background image:

Decompressed chunk

Since we have palettes, these decompressed bytes must be indexed palette values. Converting them directly results in a too dark image, because the palette contains VGA colors, which are encoded only in 6 bits. After converting them to 8 bits, the values get larger, resulting in a brighter accurate image:

Converted chunk

Unused items

So, the big question: are the early icons present? Sadly not, however…

Unused icon
Unused mask

This item never appears in-game! Which reminds me of this scene:

Science lab cabinet

Fits like a glove, however we need to figure out how to create a masked image.

Masking

Testing with the first item (we already know how it should look like), one of the blend modes that somewhat fit was difference. While shadows looked fine, some parts should have been added, such as the reflection in the surface, which is lighter in the expected image. So either signed addition or xor could work, and the latter was the best fit. Still, some colors were incorrect (like the bottom red detail):

In order: mask, difference, xor, expected Blend attempts

Seemed odd that xor would work fine with greyscale values but not colors… unless we should xor palette indexes. It made sense from a computational point of view: why waste resources converting the indexes to colors for both pictures and then applying xor to a larger number of bytes, when you could just xor the indexes and then convert that? Indeed that was the case, and after parsing all the needed masks and positions (the latter were contained in the 0x14 sized metadata entries), we get an accurate recreation of the cabinet with the third unused item:

Cabinet with unused item

After wasting a lot of time looking at these graphics, I noticed an unintended glitch in the cabinet surface:

Cabinet glitch

This was introduced by the mask of the middle item, which for some reason includes lines for the surface, which happens to cause a misalignment in the dithering when applied:

Middle item mask

Parsing .FNT

Satisfied with this discovery, I moved on to other file formats. The fonts were straightforward, since they were encoded as bitmaps:

Raw data imported from MC001.FNT

Note that each line is encoded in 2 bytes (2 * 8 bits). We can get the metadata size (from 0x0 to the start of the exclamation point, so hex(2 * 8 * 69 // 8) = 0x8a) and data size for each character (hex(2 * 8 * 11 // 8) = 0x16). The metadata includes kerning data encoded as width values for each character (they matched the number of characters and were always in range [0x0..0x10]).

To better illustrate this structure, here’s a highlighted hex dump, with the following colors applied by field:

  • magenta: metadata header
  • brown: character widths
  • green: 1st character
  • cyan: 2nd character
Sections in MC001.FNT

Putting all this together allows us to render these characters with kerning applied:

Rendered font

Parsing MCSTR.DAT

Figuring out how text was stored interested me much more, to see if there was some description for the unused item.

Curve for MCSTR.DAT

Clusters:

  • Sparse values at the beginning (again, metadata);
  • Some mixture of low and high (sometimes 0xff) values;
  • ASCII valued block (plaintext words);
  • High entropy blocks.

Here’s where I got stuck. Some of the metadata entries seemed to contain sizes and offsets to compressed blocks, but I just couldn’t get decompression to work. The parsing process needed to be closely inspected, so it was time to reverse the executable.

Disassembly

Due to the particular executable format at hand (a LE variant of linear executable, extended with DOS/4GW), we need to follow in these footsteps:

  1. Unbind and discard the extender with SB /U MISSION.EXE, retrieving the original executable;
  2. Disassemble with IDA 4.1, import the database with IDA 5.0.

To identify the right subroutines to analyze, let’s go back to the debugger.

We want to break whenever an open file operation is called. In MS-DOS, instead of syscalls, we directly call interrupts, such as OPEN EXISTING FILE. Nevertheless, they follow the same architecture calling conventions: registers are used to pass arguments and store results.

The above reference can be looked up to get the interrupt number and AH register value to set a breakpoint: BPINT 21 3D. Now we continue and refresh the data view with Alt-X (according to the interrupt reference, DS:DX contains the address to the filename) until we break on the interrupt for MCSTR.DAT:

---(Register Overview                   )---
EAX=00003D00  ESI=001DA873  DS=0188   ES=0188   FS=0000   GS=0020   SS=0188 Pr32
EBX=00000000  EDI=FFFFFFFF  CS=0180   EIP=00255B28  C0 Z1 S0 O0 A0 P1 D0 I1 T0
ECX=002969E6  EBP=002B69E4                                          IOPL0  CPL0
EDX=001DA873  ESP=002B6858                                  14981947

---(Data Overview   Scroll: page up/down)---
0188:001DA873 4D 43 53 54 52 2E 44 41 54 00 00 00 00 48 BB 28  MCSTR.DAT....H.(
[...]

---(Code Overview   Scroll: up/down     )---
0180:255B28  CD21                int  21

DOSBox doesn’t implement any call stack view, so we have to continue until the next RET instruction, then step into the next caller’s instruction, repeating this until we arrive at the subroutine that loaded the filename address.

At this point we want to match the addresses in the debugger with the ones in our disassembly. Supposedly you can directly convert from DOSBox offsets to file offsets, but I ended up just taking the hex values of a few instructions until I got a single match in the file:

0180:255B28  CD21                int  21
0180:255B2A  D1D0                rcl  eax,1
0180:255B2C  D1C8                ror  eax,1
0180:255B2E  89442404            mov  [esp+0004],eax
0180:255B32  85C0                test eax,eax
0180:255B34  7C07                jl   00255B3D ($+7)
binwalk -R "$(printf cd21d1d0d1c88944240485c07c07 | sed 's/\(..\)/\\x\1/g')" MISSION.LE.EXE
# 0xB38D8

Then we take that file offset, check the first offset in the disassembly (0x100000), along with the corresponding file offset reported by IDA (0x31db0) to arrive at the target IDA offset 1:

hex(0xb38d8 - 0x31db0 + 0x10000) = 0x91b28

MISSION.EXE was original compiled from C/C++, not written in assembly, since we can find the string:

WATCOM C/C++32 Run-Time system.

So we expect to see symbols for the usual libc functions that wrap these filesystem interrupts (we can also look up which exactly in the library reference).

By following cross-references (xrefs), we get this subroutine hierarchy:

sopen_ (0x91af6) < __doopen_ (0x9ae79) < _fsopen_ (0x9af41) < fopen_ (0x9af5c) < sub_911e0 (0x911e0)

sub_911e0 has a large number of xrefs, and is already user code, not library code. Seems like a good candidate for a subroutine that would be called to load different files at several points. Our current caller (sub_144b4) loads the offset for string MCSTR.DAT (dword_16873), along with an error message if open failed (return code = 0):

000144C4                 mov     eax, offset dword_16873
[...]
000144C9                 call    sub_911E0
000144CE                 mov     dword_E34B8, eax
000144D3                 test    eax, eax
000144D5                 jnz     short loc_144EE
000144D7                 push    offset dword_16873
000144DC                 push    offset aCanTOpenFileS ; "Can't open file %s"
00016873 dword_16873     dd 5453434Dh, 41442E52h, 54h ; DATA XREF: sub_144B4+10o
[hex(ord(x)) for x in 'MCSTR.DAT']
# ['0x4d', '0x43', '0x53', '0x54', '0x52', '0x2e', '0x44', '0x41', '0x54']

Let’s give this subroutine a name (wrap_open_mcstr) and check the next calls with Graph view:

Seems like they do file reads, due to the error message that is loaded afterwards. Another call takes the result of multiplying ebx and edx:

And does getchar() if that result is 1, otherwise read():

Going back to the caller, we can infer ecx is the file pointer (fp_mcstr_dat, its value comes after standard streams + the number of previously open files), ebx is the number of times to read edx sized bytes, and eax contains the address were the read bytes are stored (num_entries) 2. Afterwards, 6 bytes are read in a loop, as many times as the previously read value, and stored in an array (entries), while accumulating sizes read from those 6 bytes (sum_entry_head):

The next instructions do similar parsing of sections in the file, allocating pointer tables to hold their data. Eventually we reach a point where an offset for the start of the compressed blocks is stored (start_cx_block, note that ftell() returns the current position of the file pointer, which comes after the previous sections were all parsed). This offset is confirmed in the debugger to be 0x19e1, matching the start of the first high entropy block:

Memory allocations are attempted with a value of 0xc00, and if that fails, successively smaller amounts are tried (maybe_c00), until it succeeds (actual_mem_c00) or fails due to not enough free memory. This means our decompressed block size has an upper bound of 0xc00.

To sum it up, wrap_open_mcstr parses the following sections:

  • Section 0, [0x0..0x2]: number of entries (n = 0x24)
  • Section 1, [0x2..0x2 + n * 0x6 = 0xda]: entry descriptions
    • 2 bytes: number of blocks in entry
    • 4 bytes: total size of blocks
  • Section 2, [0xda..0x16a2]: block sizes
    • array of 2 bytes per value
  • Section 3, [0x16a2..0x1a08]: lookup table
    • 2 bytes: number of lookup values
    • array of 2 bytes per value
  • Section 4
    • [0x1a08..0x1b0a]: plaintext word indexes
      • 2 bytes: number of words
      • array of 2 bytes per value (first = 0x0)
    • [0x1b0a..0x1e91]: plaintext words
      • 2 bytes: total size of words (n = 0x387)
      • 0x387 bytes: words
  • Section 5, [0x1e91..]: compressed block data
    • array of variable bytes per block

Highlighted hex dump, with the following colors applied per section:

  • magenta: element counts
  • brown: sizes / indexes
  • green: 1st value in section
  • cyan: 2nd value in section
Sections in MCSTR.DAT

The value at 0x1a0a tells us that the first plaintext word starts at byte index 0 of the plaintext word table (not included above). The first compressed block spans [0x1e91..1ee6] (size 0x55, read from 0xda), while the second compressed block spans [0x1ee6..0x1ef2] (size 0x0c, read from 0xdc).


Although we don’t know the purpose of the lookup table, the compressed blocks have been identified.

While testing my Crusher CLI, I noticed that the library is pretty tolerant to unexpected values:

  • If trailing data is added to a block, decompression still works fine;
  • If a decompression size smaller than the original file size is provided, this results in the decompressed output being truncated, but still matching the original bytes;
  • If a decompression size larger than the original file size is provided, this results in null bytes being appended to the decompressed output.

Therefore, even if we don’t know the exact decompression sizes, we can still try to decompress these blocks… except it still doesn’t work. To be fair, some block sizes were suspiciously small (e.g. 0xc), and the minimal compressed block I could generate in my tests was larger than that. Could it be… another compression algorithm? Only one way to be sure: reversing the decompression subroutine, wherever it is.

Luckily, there were only 2 xrefs for start_cx_block: the function we saw before and another one, which also loaded offsets to data structures used to hold the previously parsed sections.

Basically, this subroutine takes an entry index and a block index as input, and traverses the previous pointer tables to get the corresponding size and arrive at the right offset (if the target block index at si doesn’t match start_cx_block, it moves forward as many block sizes read from entries_sums as needed):

Later on, the allocated space at actual_mem_c00 is passed as argument eax to a subroutine (wrap_s3_read), along with the address of the first lookup table value (mem_s3e1) as ecx, file pointer at the position of target compressed block offset as ebx, and the compressed size to read as edx:

Whenever a saved game is restored, a label for the current room you are in is displayed. By placing a breakpoint before the call to wrap_s3_read, then loading a saved game, we can see that after the subroutine is called, the allocated space now contains the decompressed label (“Communications Center.”), with the total number of decompressed characters returned in eax:

---(Register Overview                   )---
EAX=00000017  ESI=000700D3  DS=0188   ES=0188   FS=0000   GS=0020   SS=0188 Pr32
EBX=0009FFFF  EDI=000700D3  CS=0180   EIP=001D8CB7  C0 Z0 S0 O0 A1 P1 D0 I1 T0
ECX=00000009  EBP=00FD95B0                                          IOPL0  CPL0
EDX=00000005  ESP=002B68F0                                  350756628
---(Data Overview   Scroll: page up/down)---

0180:00FD95B0 43 6F 6D 6D 75 6E 69 63 61 74 69 6F 6E 73 20 43  Communications C
0180:00FD95C0 65 6E 74 65 72 2E 00 B0 EA B0 B0 EA EA EA E0 EE  enter...........
0180:00FD95D0 B0 EA B0 EE B0 EA B0 B0 EA B0 B0 B0 EA B0 EE B0  ................
0180:00FD95E0 B0 B0 B0 C1 B0 B0 B6 C1 B0 B6 C1 B6 F2 B6 F2 B6  ................
0180:00FD95F0 C0 C8 F2 43 C8 43 C8 43 43 B8 43 F7 43 6E 43 F7  ...C.C.CC.C.CnC.
0180:00FD9600 F7 F7 F7 F7 F7 F7 F7 F7 F1 F7 F7 F1 F7 F7 F7 F7  ................
0180:00FD9610 F7 F7 F7 43 F7 43 43 43 43 43 43 43 B8 43 43 B8  ...C.CCCCCCC.CC.
0180:00FD9620 43 C8 C0 43 43 43 C0 43 C0 C8 C0 C0 B6 43 46 F2  C..CCC.C.....CF.

---(Code Overview   Scroll: up/down     )---
0180:1D8CB2  E879FCFFFF          call 001D8930 ($-387)

Clearly this is the subroutine with the decompression logic! Let’s dig deeper…

si is initialized with the compressed size to read, and an unsigned test is done for si > 0 (i.e. do we still have bytes to read):

If so, then some variables we don’t know the purpose of are also checked (we’ll skip them for now). The first lookup table value to be considered (s2_current_value) is not a lookup value, but the total number of lookup table entries minus 2 (s2_sizes_minus2, not the best name, but “sizes” was my hunch at the time):

Eventually, we read the first compressed byte (stored at s3_current_byte), and initialize dl with 8, a counter for the following loop (notice the green branch that goes up, back to the variable checks):

If the loop ends because the counter reached 0, we read the next compressed byte and reset the counter.

Finally, we get to see the lookup table (s2_values) taking part in some arithmetic and dereferencing operations. Note how previous lookup values are used to get the next value (s2_current_value). These operations can be simplified as:

loop_counter = 8
s2_current_value = s2_sizes_minus2

while loop_counter > 0:
    first_bit = s3_current_byte & 1
    s3_current_byte = s2_current_value | first_bit

    lookup_index = s3_current_byte - s2_sizes
    s2_current_value = s2_values[lookup_index]

    s3_current_byte >>= 1
    loop_counter -= 1

    if s2_current_value < 0:
        # [...]

Note that the table is accessed from the end, since lookup_index is always negative (considering python array indexing). If you go back to the highlighted hex dump, you can verify that the lookup table values are signed: some values are positive (e.g. 0x2, 0x4), while others are negative (e.g. 0xffdc, 0xffd9). When the value is negative, we arrive here:

The previous lookup value is negated and decremented (s3_v_neg). Then:

  • If the result is < 0x80, it is stored as a decompressed character at edi, incrementing the number of decompressed characters so far (num_dcx_chars);
  • Else, it is decremented by 0x80, and some pointer arithmetic is done to access one of the plaintext words (ptr_plaintext). If the word size is > 0, it takes the branch to loc_14A23, where each character of the word is stored at incrementing addresses of edi, also incrementing the number of decompressed characters so far (num_dcx_chars).

When this second loop ends, we go back to the variable checks, which can now be understood:

If there are no more characters to read (si) or the transformed lookup value (s3_v_neg) is 0, we finished decompressing.


This word-based compression algorithm seems to be based on a Dictionary coder, given that a concordance of the full text was built (i.e. the sorted plaintext words with high frequency), where indexes are most likely represented via Huffman coding as negative values, while less common words and non-words (i.e. punctuation) are encoded by character as positive values.

We can verify the concordance by creating our own from the decompressed full text, with some rough cleanup on terminators and whitespaces, then counting the top word frequencies, and taking the corresponding words:

cat * \
    | sed 's/[ \t\n]\+/\n/g; s/[^ \t\na-zA-Z0-9_-'"'"']//g; s/[^[[:alpha:]]*$//g' \
    | sed 's/^\s*..\?\s*$//g' \
    | sort \
    | uniq -c \
    | sort -n \
    | tail -n125 \
    | awk '{print $2}'

After sorting and comparing with the original using diff, most entries are matched.


Sadly, I couldn’t find a matching description for the unused item (perhaps it was in one of the empty blocks, some are present in a few entries). However, there are some curious debug messages:

MAV *** REPORT THIS TO MIKE ASAP - THERE SHOULDN'T BE ANY MORE MAVS*** It's the Autodoc System.
MAV *** REPORT THIS TO MIKE ASAP - THERE SHOULDN'T BE ANY MORE MAVS: It's a control servo module.
MAV *** REPORT THIS TO MIKE ASAP - THERE SHOULDN'T BE ANY MORE MAVS: th_oxygen_feeds
MAV *** REPORT THIS TO MIKE ASAP - THERE SHOULDN'T BE ANY MORE MAVS: ct_data_collection_system.
MAV *** REPORT THIS TO MIKE ASAP - THERE SHOULDN'T BE ANY MORE MAVS: th_frequency_Module.
MAV *** REPORT THIS TO MIKE ASAP - THERE SHOULDN'T BE ANY MORE MAVS: th_yellow_sticky.

These are all contained on blocks of the same entry (0xa), maybe they were used to hunt down a bug in a particular scene.

Save game patching

We can actually load these debug messages by modifying values related to items in save game files. First, let’s save the game after these atomic inventory actions:

  1. Open cabinet in science lab, two items available to take (inventory unchanged);
  2. Take the first item (added to inventory);
  3. Take the second item (added to inventory).

Then, we can compare the first two, disregarding the first byte since it’s just the save game filename:

At 0x2b two bytes are updated, and the first one is again updated after taking the second item:

If we try increasing that byte:

Our inventory holds what appears to be the room’s objects, including the only pickable item, and an item that references one of those debug messages:

All unpickable items use an icon we never get to see during normal gameplay! This functionality makes an interesting debug mode, maybe developers used it to check if all expected objects were loaded, without having to hover around the room with the mouse.

TODO

  • Data that links backgrounds with masked images needs to be parsed.
  • There are also some file formats that weren’t explored, such as .Q, which contain audio and animations.

Feel free to join the fun.

Credits

  1. We could also take the difference between IDA and DOSBox offsets:

    hex(0x255b28 - 0x91b28) = 0x1c4000
    

    And use that to rebase our disassembly (Edit > Segments > Rebase program...), so that offsets in both apps are calculated from the same base address (0x1c4000).

    However, IDA 4.1 doesn’t have rebase… is it move?

    What? Let’s check ida.hlp

    Moving a segment means moving its beginning. So, the proper name for this command would be ‘Expand/Shrink a Segment’ (due to historical reasons the name is ‘move segment’).

    This doesn’t seem equivalent at all. Well, if we try rebasing in IDA 5.0:

    lx.ldw: can't load file (error code 126)
    

    Any luck just copying over lx.ldw from IDA 4.1?

    Access violation at address 7220656C. Read of address 7220656C.
    

    So much for that… 

  2. As an alternative to reversing callee subroutines, figuring out where results are stored is a matter of following register changes in the debugger, and setting the Data overview to the same address a register was set to before a call. This way, we can see how many bytes were read and stored at that address, comparing them against the hex dump.