How to emulate a coin-op game on a "small memory" device

Intro

In a previous life I was paid to emulate coin-op games, lots of games. In total, I worked on more than 300 titles. The games were for closed source products and were unrelated to my open source projects and had nothing to do with more recognizable game emulators like MAME. They were sold in commercial game systems such as Chicago Gaming's "Ultimate Arcade 2", some mobile devices such as Microsoft's Pocket PC launch (PocketPak's) and Turner's GameTap system. I made my living by writing these game emulators for a period of about 10 years. Much of the code I wrote back then had to be very efficient because most of our platforms were embedded Arm devices. This same code can find new life on today's Arduino/embedded microcontrollers. As a starting project, I decided to port my Space Invaders emulator to run on moderately spec'd 32-bit MCUs. Taito's Space Invaders was one of the first raster graphics video games to exist and was composed of relatively simple hardware (by today's standards). That same hardware was copied by several competitors and used to publish other game titles in the late 70's and early 80's. The machine is composed of a 240x224 1-bit per pixel raster CRT display (oriented 90 degrees rotated), an Intel 8080 CPU (a compatible subset of the Zilog Z80) and a TI SN76477 + some analog circuitry for sound.

What is a game emulator?

The simplest way to explain game emulation is that the emulator creates a virtual environment that's sufficiently accurate to run the original game code unmodified. In other words, the emulator's job is to allow the original game code to behave as if it was running on its original hardware (generate video, audio and react to user input). It does this using nothing but software running on a completely different set of hardware. The actual means by which is does this is not very complex, but like most software, is made up of many small parts that all have to function correctly together. Let's look at the actual code of the Space Invaders emulator and see what's happening under the hood. The loop below is the main execution loop of the emulator. It executes each frame (60th of a second of activity) of the game by emulating the correct number of cycles of CPU execution followed by video and audio updates. There are other ways to do it, but this is probably the simplest to visualize. In the case of Space Invaders, the machine divides the display into top and bottom halves to prevent tearing/flicker of the moving objects. You can see that there are 2 time slices for each frame which generate different interrupts for the CPU.

while (!bUserAbort) {
   for (j=0; j<2; j++) {
      iClocks = 2000000/120; // 2Mhz @ 60Hz executed in 2 phases
      if (j & 1) {
         ucIRQs |= INT_IRQ;
         z80.ucIRQVal = 0xcf; /* RST 08 = 8080's IRQ */
      } else {
         ucIRQs |= INT_IRQ;
         z80.ucIRQVal = 0xd7; /* RST 10 = 8080's NMI */
      }
      EXECZ80(&z80, emuh, &iClocks, &ucIRQs);
      EMUPlaySamples();
   } // for j
   EMUScreenUpdate(lDirtyRect, NULL, 224, 240);
   iFrame++; /* Count the number of frames */
} // while running

What changes were needed for embedded use?

If you're unfamiliar with MAME or other game emulator code, it generally uses a lot more memory than the original game had and requires a relatively fast machine to run the emulated games at their native speed. When working with Arduino's or other embedded microcontrollers, the capabilities of the target processor are usually the opposite - a very small memory size and relatively slow CPU. Some of the challenges with RAM usage in certain games just can't be overcome, but with a relatively modest challenge like Space Invaders, we can succeed in making it fit. These are the subsystems competing for RAM to emulate Space Invaders:

- Temporary storage to unzip the files containing the game
- Z80 memory map (64K of address space)
- 224x240 bitmap for the playfield
- Any additional structures/variables needed to emulate the game

Luckily I modified zlib specifically to use less memory. Unzip needs to support FLATE data with up to a 15-bit (32K) window size. The Huffman tables, file buffers, etc plus the memory window normally require a lot more than the 41K my library needs.

Next, the Z80 CPU has an address space of 64K bytes. Game emulators normally allocate this 64K in RAM and then copy the game ROMs into this memory. The game's actual RAM is just derived from part of this 64k and the rest is usually wasted. I chose to modify my Z80 code to use a sparse memory map. This saves RAM, but uses more CPU cycles. The reason it uses more cycles is because I divide the memory map into 16 blocks of 4K each and translate each address to reference the correct block with a lookup table. This allows me to use only 16K of RAM for Space Invaders (8K for the game ROMs and 8K for the video RAM). The rest of the memory map is empty and doesn't use any RAM.

The video and scratch RAM of Space Invaders is a 7K window in memory (0x2400-0x3FFF). The memory is a 1-bit per pixel bitmap and the pixels are drawn on the CRT as black and white dots. Any coloration of Space Invaders pixels is done with a transparent film which covers the B&W CRT glass. To emulate these pixels on a PC you would normally allocate a bitmap object of the same size, draw into the bitmap based on the Z80's memory contents and then pass it to the PC's display controller. For this version of the code, I've written it to talk to a SPI or parallel LCD controller which contains its own video RAM. This allows me to avoid allocating extra memory for a temporary bitmap and instead send the pixels directly to the display.

If not for the unzip stage, this game emulator could run on a system with only about 17K of RAM. There's no requirement to read the game data from a ZIP file, but that's the form that's more convenient to work with.

How Does it work?

What's not visible in the main loop above is the initialization and some of the additional activity that needs to occur during the CPU emulation. The hardware generating the video, audio and monitoring the user controls is connected to the CPU through I/O ports and memory mapped addresses. In order to emulate this in a generic way which allows the CPU emulator to be used for other purposes, a dynamic mechanism needs to exist to attach specific emulated hardware to specific port/memory addresses. The way I handle this (on small memory devices) is to have a set of read and write flag bits. Each bit represents 4k bytes of address range. For Z80 ports (IN/OUT instructions), I use a single function for the entire address range. To map the flag bits to the 64k Z80 address map, you would just divide the address by 4096 (shift right 12). To know if a specific address needs a special "handler", just test if the corresponding bit is set. In addition to the I/O emulation, this code needs to fit in a small memory device. I accomplish this by creating a sparse map of the Z80 64K address space by also dividing it into 4K pages and having an offset list pointing to each page. That way, the unused parts of memory don't have to be mapped to anything. Here's how the Z80WriteByte() function implements the sparse memory map and I/O logic:

void Z80WriteByte(unsigned short usAddr, unsigned char ucByte)
{
unsigned char c;

   c = (unsigned char)(usAddr >> EMU_ADDR_SHIFT);
   if (z80_writeflags & (1 << c)) { /* If special flag (ROM or hardware) */
      *iRealClocks = iMyClocks; /* In case external hardware needs accurate timing */
      (mem_handlers_z80[c].pfn_write)(usAddr, ucByte);
   } else { /* Normal RAM, just write it and leave */
      *(unsigned char *)((unsigned long)usAddr + z80_ulCPUOffs[c]) = ucByte;
   }

} /* Z80WriteByte() */


The handler functions invoked by calling the various pfn_write pointers will check the address passed to them and then take appropriate action to emulate the hardware that corresponds to it. As indicated in the comment above, the write handler can also serve to prevent writes to addresses containing ROM. Some games purposely write into the ROM address range to thwart software emulation. Since an emulator will usually load the game code into RAM, this is a clever gotcha to prevent it from running.

Optimizing for Embedded

In the previous section I described how to reduce the RAM needed by the emulator. Equally important to getting it to run at full speed is to optimize the code for execution time. One of the biggest challenges for projects needing video out is that MCUs normally don't have video out. Inexpensive SPI displays are one way to generate graphics output that approaches what one could call video. Let's suppose that we've chosen to use a ILI9341 240x320 SPI LCD (a very common display). If we connect a typical 32-bit MCU to it, we might expect to get a reliable SPI speed of 40Mb/s (40Mhz). That display has 16 bits per pixel and 240x320 pixels, so to refresh the entire display requires transmitting 1228800 bits. In ideal conditions we might be able to transmit that to the display in about 30ms (1 /(40,000,000 bps / 1228800 b)). That means we could potentially update that display at 35 frames per second. This assumes that we have enough memory and a perfectly arranged ping-pong buffer for DMA to transmit the bits and allow the CPU to continue processing during that time. The reality is that we can't get optimal output like that. The other disappointing news is that the game needs to update the display at 60fps to look 'good' and some of the time to transmit the bits will be spent by the CPU to prepare the data and wait for it to finish transmitting. The good news is that a simple game like Space Invaders has a relatively slow CPU and is only able to update small areas of the display during each frame time. In my emulator code, I have a variable called lDirtyRect which is an integer bit field with each bit corresponding to 16 rows on the display. As video RAM is written to, if the values change, the bit corresponding to that section is marked as 'dirty'. When the frame has completed, the display update code only writes pixels to the display in areas that have been changed. This is a very effective way to drastically reduce the amount of data sent to the display. This, in turn, greatly increases the chance that the game can update the display at 60fps. More advanced video games which update the entire display each frame (e.g. a scrolling background) will not benefit from this idea. More complicated games require a more complicated system of 'dirty rectangles' to reduce the amount of drawing per frame. Here's how it's done in the video RAM write handler:

void InvadersVideoWrite(int usAddr, unsigned char ucByte)
{
int x, y, i;

   if (ucRAM[usAddr - 0x2000] != ucByte){ // did it change?

      ucRAM[usAddr - 0x2000] = ucByte;

      usAddr -= 0x2400; // Subtract starting address of display RAM

      x = usAddr >> 5;
      y = 240 - ((usAddr & 0x1f) * 8);
      if (y < 0 || x > 223)
         return; // Not visible
      i = (y >> 4); // Horizontal bands of 16 rows each
      lDirtyRect |= (1 << i); // Mark this section as 'dirty'

   }
} /* InvadersVideoWrite() */

Hacks / Modifications of the code

Wasted CPU cycles in the original game are also a good area to investigate for optimizing your emulator. Most coin-op game authors were not concerned with power savings because the games were powered from wall current. Coin-op game code was written to fit within the capabilities of the CPU. If a particular game has a very complex section of gameplay that stresses the CPU, that 'worst case' code must run not cause any additional latency or the user will notice. This also means that the average game activity is not stressing the CPU. Because of this situation, there will be lulls in CPU activity for the much of the game. Game authors usually used the vertical blank interval (VBlank) to set the pace of the game. If the work that the game had to accomplish was finished before the VBlank period was over, they would often make the CPU sit in a time-wasting loop to wait for the VBlank interrupt to start the next frame. These time wasting loops can represent a large percentage of the total CPU execution time. I've experimented with ways of exploiting this fact in my various emulators and have come up with two solutions - automatic and manual.


Automatic Shortcuts

The code below shows one of the optional 'hacks' that I use to shortcut these time wasting loops. A condition branch to a few bytes behind the current Program Counter usually means that no actual work is occurring and it's just counting down to waste cycles.


case 0xC2: // JP NZ,(nn)
   if (!(regs->ucRegF & F_ZERO)) {
      usAddr = p[1] + (p[2] << 8);
      iMyClocks -= 5;
#ifdef SPEED_HACKS
      if (PC - usAddr <= 3) // time wasting loop
         iMyClocks = 0;
#endif // SPEED_HACKS
      PC = usAddr;
   } else
      PC += 2;
break;

Manual Shortcuts

Another way to prevent time wasting loops from executing is to patch the game code to disable them. The risk in doing this is that game authors started to put self-checks into the code to make sure it hadn't been modified/pirated. The check was usually done as a simple checksum of the game code at power up. It's usually safe to patch the code after the self-test has completed. Some of the more clever game authors would be sneaky about the mode of failure when a checksum error occurred. Instead of making the game display an error or refuse to run, they would have the game purposely misbehave in unexpected ways. The emulator author would then potentially chase these bugs instead of realizing that the checksum had failed. To patch Z80 games, I usually replace the start of the time wasting loop with a HALT instruction. This instruction causes the CPU to wait for an interrupt before continuing. My implementation of the HALT instruction simply sets the remaining cycles to 0 which causes the current frame's time to end immediately. Each game may have multiple time wasting loops that each need to be discovered separately. A simple way to find them is to see what the value of the Program Counter is right before the next VBlank interrupt occurs. For Space Invaders, this value is 0x0A93. Here's the code at that address:


0x0A9E: LD A,(0x20C0)   // A = 0x07

0x0AA1: DEC A

0x0AA2: JP NZ, (0x0A9E)


This code can execute thousands of times per frame depending on the game activity. On a different CPU, this code could cause problems because after the interrupt service routine executed, it would return to this loop. This would force all of the game logic to execute within the interrupt service routine. On the 8080/Z80, the interrupts can be configured to just jump to a specific address without stacking the current program counter. In the case of Space Invaders, it just jumps to the start of the logic for running the top or bottom half of the current frame.


Bringing it all together

In the video below, I have the emulator running on an Adafruit PyPortal - (120Mhz Cortex-M4 with 192K of RAM and an ILI9341 240x320 LCD display). For Space Invaders, this is more than enough CPU power and RAM to run my emulator at 60fps.  I haven't hooked up sound or the user controls yet, but I'll cover those in my next blog post...




I've shared the current code with a Github Gist here:

Space Invaders for Arduino



If you appreciate my blog posts and open source work, please consider sponsoring me on Github:

My Github Sponsors Page

Comments

  1. "I made my living by writing these game emulators for a period of about 10 years." I'm envious of your life.

    ReplyDelete

Post a Comment

Popular posts from this blog

Surprise! ESP32-S3 has (a few) SIMD instructions

How much current do OLED displays use?

Fast SSD1306 OLED drawing with I2C bit banging