A few years ago, I think p4plus2 let me know about the existence of a certain piece of 6502 code. Back then, I didn't understand so much about it and just shrugged it off as "complicated". Recently, I decided to revisit this and noticed that this is actually something very cool and could totally be implemented into SMW.
Let's get right into the good stuff.
Inflate code (2018-07-11 10:54:40 AM @ 4753212 clocks, originally 7695346)
Compressed test files
inflate, before all optimization magic.
p4plus2 is the main contributor to the code while I assisted him with stuff like producing debug files and a bit of optimizing. He wrote the entire thing himself (it's not based off of the repo link from earlier). The code is useable (and impressive), but in terms of performance, it's still unfinished and you'll find out why as you read this post. At the moment it assembles into a homebrew ROM we use for testing. asar in the tools section should work just fine for this.
The code currently runs assuming there are no interrupts. When we/you port this to Super Mario World, keep this in mind. Although interrupts should be disabled during level load already.
DEFLATE is a 'popular' compression method. (Decompressing it is called "inflate"). It is popular in the sense of it being used even in modern-day compression, such as in gzip or PNGs. The usage is so widespread that certain languages have support for DEFLATE compression readily available. Interestingly enough, DEFLATE wasn't invented until a few years after the SNES was released. The format of the compression is discussed later.
zopfli made by Google excels at compressing in DEFLATE, maximizing compression but being slow as a result (at least, when you tell it to be slow). I compared the current SMW compression methods with DEFLATE. Here are some legitimately exciting numbers.
As you can see, using DEFLATE saves over half a bank of space compared to LC_LZ3. That is quite incredible. You can notice that the compression really goes at it, if you open the compressed files in a tile editor; in DEFLATE you pretty much don't see any trace of the original graphics at all. Here's GFX00.bin:

However, compression isn't only limited to graphics. It could also be used to compress other assets which use a lot of space, particularly, custom music.
Compressing all the original SMW music into DEFLATE format also saves over half a bank of space. A noticeable trend here is that as you compress bigger files, the compression gets more and more efficient. See: the difference between music01 and music28. However, to decompress songs you'll need a particularly large RAM buffer. GFX usually requires 0x1000 bytes per file, but music can exceed that. Then again, free RAM shouldn't be a problem during level load, but dynamically uploading songs to the ARAM during gameplay (is that a thing in present-day SMW hacking?) could cause problems. Either way, this is something that is very well within the realm of possibility.
For the sake of even more statistics, I also compressed the (optimized) SMW samples supplied with AddmusicK. This one is interesting, as samples technically speaking are already compressed (BRR). Here are the results:
The space saved from compressing samples is negligible. The compressed files barely have a difference in size compared to the raw files. Furthermore, samples don't seem to be large to begin with, but custom samples could be a different story.
Finally, p4plus2 also benchmarked the performance of the current ASM decompression algorithms using GFX00.bin as a base. Here are the results:
We tried aiming for at least "lz2 (original)" as it's SMW's original decompression routine. Loading times of levels shouldn't be too noticeable with that performance. However, we're still a long shot away from that milestone. We would like some help with it, if anyone is up to the challenge. More details at the end of the post.
Although I don't fully comprehend DEFLATE myself, I'll try my best explaining the basics. DEFLATEd data consists of blocks, each with a 3-bit header: XXXX XYYZ. DEFLATE rather uses a bit stream for the Huffman-compressed (explained later) data, meaning that information within the compressed data isn't "byte-aligned".
Z = Is final block to process? 1 for final, 0 for more blocks to come
YY = Block command. 00 = direct copy, 01 = fixed Huffman tree, 10 = dynamic Huffman tree, 11 = unused/error.
Direct Copy
"The stored block option adds minimal overhead, and is used for data that is incompressible." ~ Wikipedia. Basically copy data directly to the output. Direct copy has the following header bytes: [inflate_command_bits][length][length][!length][!length]
Where "length" make a 16-bit value of the length of the data that should be copied, and "!length" is the inverse value. e.g. [0x1000][0xEFFF]
Fixed/Dynamic Huffman tree
Huffman code is a way to turn text into a bit stream, by assigning the most common letters with the least data-consuming bits representation, while assigning the least common letters with the most data-consuming bits representation. As a result, data gets compressed. For an explanation, visit this video. You can code data into bits. Decoded data from said bits are called "symbols". It's a very fascinating technique. This isn't limited to just text, however, but it's the least abstract example.
The difference between fixed/dynamic trees is that fixed blocks use a pre-defined Huffman tree inside the decompressor code (so we don't need to store a dictionary inside the compressed data), while the dynamic Huffman trees store the dictionary inside the compressed data following the block header.
The Huffman trees also have the capability of backreferencing. If the decoded symbol happens to be a value over 255, it means that the decompressor should copy an already decompressed block. This is useful for repeating data. Kind of like the "Repeat" command in LC_LZ2.
It is possible that compressed files solely consist out of a single block with a ton of backreferences. For example, GFX00.bin DEFLATEd is a single dynamic block with 299 backreferences.
If you saw the statistics from earlier, you probably saw how DEFLATE is a superior compression method but is quite slow. Therefore, we'd like help on two fronts:
- Optimize the code so it uses less cycles, thus decompressing files faster
- Make an SA-1 port. As DEFLATE uses a bit stream, you might want to look into SA-1 register $2258
To count clocks, use p4plus2's custom build of lsnes. link. These are the sources which build on Linux. I tried it myself using a virtual machine but as I'm a complete newbie to Linux I couldn't manage, so it was p4 who really did all the benchmarking. Maybe someone else can manage to put the following together and and point people to the right direction.
Note the STZ $5003/5005 in the code. These are special debugger registers p4plus2 added to lsnes. First STZ starts the timer and the second one stops it. Also, additionally: dump.lua
Also, opcode reference with cycle count
This brings back memories. Let's make a miracle happen again!
Let's get right into the good stuff.
The code
Inflate code (2018-07-11 10:54:40 AM @ 4753212 clocks, originally 7695346)
Compressed test files
inflate, before all optimization magic.
p4plus2 is the main contributor to the code while I assisted him with stuff like producing debug files and a bit of optimizing. He wrote the entire thing himself (it's not based off of the repo link from earlier). The code is useable (and impressive), but in terms of performance, it's still unfinished and you'll find out why as you read this post. At the moment it assembles into a homebrew ROM we use for testing. asar in the tools section should work just fine for this.
The code currently runs assuming there are no interrupts. When we/you port this to Super Mario World, keep this in mind. Although interrupts should be disabled during level load already.
What is DEFLATE?
DEFLATE is a 'popular' compression method. (Decompressing it is called "inflate"). It is popular in the sense of it being used even in modern-day compression, such as in gzip or PNGs. The usage is so widespread that certain languages have support for DEFLATE compression readily available. Interestingly enough, DEFLATE wasn't invented until a few years after the SNES was released. The format of the compression is discussed later.
Statistics
zopfli made by Google excels at compressing in DEFLATE, maximizing compression but being slow as a result (at least, when you tell it to be slow). I compared the current SMW compression methods with DEFLATE. Here are some legitimately exciting numbers.
Code
GFX FILE Raw LC_LZ2 LC_LZ3 DEFLATE(9999*) (LZ3 - DEFLATE) GFX00 0x1000 0x8B6 0x77C 0x667 0x115 GFX01 0x1000 0xC20 0xAE8 0x94E 0x19A GFX02 0x1000 0x96D 0x864 0x6F6 0x16E GFX03 0x1000 0xB8E 0xA39 0x8EB 0x14E GFX04 0x1000 0xB0E 0x9C6 0x848 0x17E GFX05 0x1000 0xB22 0x9E9 0x84E 0x19B GFX06 0x1000 0xA7E 0x97A 0x825 0x155 GFX07 0x1000 0xA33 0x911 0x799 0x178 GFX08 0x1000 0xA45 0x8F6 0x7BF 0x137 GFX09 0x1000 0xC27 0xAEF 0x8F6 0x1F9 GFX0A 0x1000 0xAED 0x907 0x831 0xD6 GFX0B 0x1000 0xA95 0x9A9 0x7D7 0x1D2 GFX0C 0x1000 0x89E 0x7B8 0x697 0x121 GFX0D 0x1000 0x7CC 0x6B8 0x5CD 0xEB GFX0E 0x1000 0xA38 0x903 0x788 0x17B GFX0F 0x1000 0x91C 0x7F4 0x666 0x18E GFX10 0x1000 0x96D 0x804 0x6D1 0x133 GFX11 0x1000 0x8E6 0x7FF 0x6F0 0x10F GFX12 0x1000 0xA5D 0x965 0x812 0x153 GFX13 0x1000 0xB3E 0x9F8 0x89C 0x15C GFX14 0x1000 0x745 0x62D 0x55D 0xD0 GFX15 0x1000 0x940 0x825 0x6A1 0x184 GFX16 0x1000 0x624 0x58A 0x4BA 0xD0 GFX17 0x1000 0x9CE 0x871 0x743 0x12E GFX18 0x1000 0xA2F 0x8EE 0x7E0 0x10E GFX19 0x1000 0x7A3 0x6D5 0x5C3 0x112 GFX1A 0x1000 0x95E 0x899 0x75E 0x13B GFX1B 0x1000 0x8B2 0x75A 0x623 0x137 GFX1C 0x1000 0x840 0x731 0x606 0x12B GFX1D 0x1000 0xB77 0xA8A 0x8C1 0x1C9 GFX1E 0x1000 0x91C 0x7FC 0x6D1 0x12B GFX1F 0x1000 0x895 0x7BA 0x6AB 0x10F GFX20 0x1000 0x9A2 0x8E2 0x732 0x1B0 GFX21 0x1000 0xA6B 0x962 0x7C1 0x1A1 GFX22 0x1000 0x975 0x85A 0x751 0x109 GFX23 0x1000 0x9A6 0x8A1 0x749 0x158 GFX24 0x1000 0x9BA 0x8D2 0x724 0x1AE GFX25 0x1000 0xBD3 0xAA3 0x89D 0x206 GFX26 0x1000 0xA01 0x851 0x7B8 0x99 GFX27 0xC00 0x8D5 0x854 0x6B0 0x1A4 GFX28 0x800 0x5F4 0x574 0x47E 0xF6 GFX29 0x800 0x4AB 0x439 0x381 0xB8 GFX2A 0x800 0x4C9 0x3F8 0x338 0xC0 GFX2B 0x800 0x6BE 0x655 0x546 0x10F GFX2C 0x1000 0x921 0x829 0x6D3 0x1EC GFX2D 0x1000 0xA5A 0x8E4 0x7C9 0x11B GFX2E 0x1000 0x8EA 0x7D8 0x65A 0x17E GFX2F 0x400 0x1C9 0x182 0x138 0x4A GFX30 0x800 0x524 0x4D1 0x3CE 0x103 GFX31 0x800 0x5D3 0x4EC 0x437 0xB5 GFX32 0x5D00 0x388F 0x329D 0x2AE2 0x7BB GFX33 0x3000 0x1C68 0x18F7 0x1548 0x3AF Total bytes saved using LZ3 rather than LZ2, for GFX files 0-33: 0x3CA4 Total bytes saved using DEFLATE rather than LZ3, for GFX files 0-33: 0x4A53 (and thus, 0x3CA4 + 0x4A53 = 0x86F7 bytes saved compared to LZ2) *zopfli has an option to iterate multiple times for even more compression, at the cost of time. Although the limit to the amount of iterations is 99999, increasing it after a certain threshold won't make any significant difference. This is evident when you use the verbose argument during compression.
As you can see, using DEFLATE saves over half a bank of space compared to LC_LZ3. That is quite incredible. You can notice that the compression really goes at it, if you open the compressed files in a tile editor; in DEFLATE you pretty much don't see any trace of the original graphics at all. Here's GFX00.bin:

However, compression isn't only limited to graphics. It could also be used to compress other assets which use a lot of space, particularly, custom music.
Code
Music compression. Using the -p argument of AddmusicK, we dumped the music into .bin files. These are the original SMW songs, inserted with AddmusicK's supplied music list, without any alterations whatsoever. MUSIC FILE Raw DEFLATE(9999*) (Raw - DEFLATE) Song name main.bin 0x26FC 0x1C28 0xAD4 (Is this the SPC engine?) music01 0x77 0x75 0x2 Miss.txt music02 0xA9 0x9C 0xD Game Over.txt music03 0x1CB 0x147 0x84 Boss Clear.txt music04 0x168 0x11E 0x4A Stage Clear.txt music05 0xBC 0x95 0x27 Starman.txt music06 0xB0 0x84 0x2C P-switch.txt music07 0x98 0x5B 0x3D Keyhole.txt music08 0x46 0x3D 0x9 Iris Out.txt music09 0xD9 0xA8 0x31 Bonus End.txt music0A 0x4E9 0x2A4 0x245 Piano.txt music0B 0x61A 0x375 0x2A5 Here We Go.txt music0C 0x476 0x249 0x22D Water.txt music0D 0x46F 0x2C8 0x1A7 Bowser.txt music0E 0x4F6 0x2B9 0x23D Boss.txt music0F 0x18A 0x127 0x63 Cave.txt music10 0x367 0x1F4 0x173 Ghost.txt music11 0x71A 0x40B 0x30F Castle.txt music12 0x29B 0x19E 0xFD Switch Palace.txt music13 0x271 0x193 0xDE Welcome.txt music14 0x99 0x63 0x36 Rescue Egg.txt music15 0x4E1 0x344 0x19D Title.txt music16 0x106 0xD4 0x32 Valley of Bowser Appears.txt music17 0x12B 0xE8 0x43 Overworld.txt music18 0x134 0xFC 0x38 Yoshi's Island.txt music19 0x1DA 0x163 0x77 Vanilla Dome.txt music1A 0x130 0x115 0x1B Star Road.txt music1B 0x155 0x117 0x3E Forest of Illusion.txt music1C 0x10C 0xC9 0x43 Valley of Bowser.txt music1D 0x822 0x391 0x491 Special World.txt music1E 0x39 0x31 0x8 IntroScreen.txt music1F 0x37F 0x22D 0x152 Bowser Scene 2.txt music20 0x37F 0x22D 0x152 Bowser Scene 3.txt music21 0x311 0x14B 0x1C6 Bowser Defeated.txt music22 0x119 0xE0 0x39 Bowser Interlude.txt music23 0x115 0xA0 0x75 Bowser Zoom In.txt music24 0x26C 0xFF 0x16D Bowser Zoom Out.txt music25 0x173 0x110 0x63 Princess Peach is Rescued.txt music26 0x1148 0x575 0xBD3 Staff Roll.txt music27 0x14D 0xCC 0x81 The Yoshis Are Home.txt music28 0x2022 0x78C 0x189C Cast List.txt Total bytes saved using DEFLATE for all default AddmusicK SMW music: 0x4766 (not counting main.bin) *zopfli has an option to iterate multiple times for even more compression, at the cost of time. Although the limit to the amount of iterations is 99999, increasing it after a certain threshold won't make any significant difference. This is evident when you use the verbose argument during compression.
Compressing all the original SMW music into DEFLATE format also saves over half a bank of space. A noticeable trend here is that as you compress bigger files, the compression gets more and more efficient. See: the difference between music01 and music28. However, to decompress songs you'll need a particularly large RAM buffer. GFX usually requires 0x1000 bytes per file, but music can exceed that. Then again, free RAM shouldn't be a problem during level load, but dynamically uploading songs to the ARAM during gameplay (is that a thing in present-day SMW hacking?) could cause problems. Either way, this is something that is very well within the realm of possibility.
For the sake of even more statistics, I also compressed the (optimized) SMW samples supplied with AddmusicK. This one is interesting, as samples technically speaking are already compressed (BRR). Here are the results:
Code
Sample compression. The compressed samples are the optimized ones supplied with AddmusicK. SAMPLE FILE Raw DEFLATE(9999*) (Raw - DEFLATE) 00 SMW @0 0x26 0x23 0x3 01 SMW @1 0x26 0x24 0x2 02 SMW @2 0x89 0x8A -1 03 SMW @3 0xB6 0xB7 -1 04 SMW @4 0x26 0x23 0x3 05 SMW @8 0x119 0x114 0x5 06 SMW @22 0x266 0x248 0x1E 07 SMW @5 0x3A1 0x37E 0x23 08 SMW @6 0x26 0x24 0x2 09 SMW @7 0x7A3 0x6CB 0xD8 0A SMW @9 0x49D 0x438 0x65 0B SMW @10 0x36B 0x352 0x19 0C SMW @13 0x26 0x23 0x3 0D SMW @14 0x185 0x183 0x2 0E SMW @29 0x599 0x54C 0x4D 0F SMW @21 0x122 0x120 0x2 10 SMW @12 0x43A 0x406 0x34 11 SMW @17 0x230 0x220 0x10 12 SMW @15 0x5F3 0x59B 0x58 13 SMW Thunder 0x953 0x8B9 0x9A Total bytes saved using DEFLATE for all default AddmusicK optimized samples: 0x32E
The space saved from compressing samples is negligible. The compressed files barely have a difference in size compared to the raw files. Furthermore, samples don't seem to be large to begin with, but custom samples could be a different story.
Finally, p4plus2 also benchmarked the performance of the current ASM decompression algorithms using GFX00.bin as a base. Here are the results:
Code
PERFORMANCE: Decompressing GFX00.bin
lz2 (original) 1638070 clocks
lz2 (optimized) 599616 clocks
lz3 1041172 clocks
DEFLATE 4753212 clocks
We tried aiming for at least "lz2 (original)" as it's SMW's original decompression routine. Loading times of levels shouldn't be too noticeable with that performance. However, we're still a long shot away from that milestone. We would like some help with it, if anyone is up to the challenge. More details at the end of the post.
The format
Although I don't fully comprehend DEFLATE myself, I'll try my best explaining the basics. DEFLATEd data consists of blocks, each with a 3-bit header: XXXX XYYZ. DEFLATE rather uses a bit stream for the Huffman-compressed (explained later) data, meaning that information within the compressed data isn't "byte-aligned".
Z = Is final block to process? 1 for final, 0 for more blocks to come
YY = Block command. 00 = direct copy, 01 = fixed Huffman tree, 10 = dynamic Huffman tree, 11 = unused/error.
Direct Copy
"The stored block option adds minimal overhead, and is used for data that is incompressible." ~ Wikipedia. Basically copy data directly to the output. Direct copy has the following header bytes: [inflate_command_bits][length][length][!length][!length]
Where "length" make a 16-bit value of the length of the data that should be copied, and "!length" is the inverse value. e.g. [0x1000][0xEFFF]
Fixed/Dynamic Huffman tree
Huffman code is a way to turn text into a bit stream, by assigning the most common letters with the least data-consuming bits representation, while assigning the least common letters with the most data-consuming bits representation. As a result, data gets compressed. For an explanation, visit this video. You can code data into bits. Decoded data from said bits are called "symbols". It's a very fascinating technique. This isn't limited to just text, however, but it's the least abstract example.
The difference between fixed/dynamic trees is that fixed blocks use a pre-defined Huffman tree inside the decompressor code (so we don't need to store a dictionary inside the compressed data), while the dynamic Huffman trees store the dictionary inside the compressed data following the block header.
The Huffman trees also have the capability of backreferencing. If the decoded symbol happens to be a value over 255, it means that the decompressor should copy an already decompressed block. This is useful for repeating data. Kind of like the "Repeat" command in LC_LZ2.
It is possible that compressed files solely consist out of a single block with a ton of backreferences. For example, GFX00.bin DEFLATEd is a single dynamic block with 299 backreferences.
We need help!
If you saw the statistics from earlier, you probably saw how DEFLATE is a superior compression method but is quite slow. Therefore, we'd like help on two fronts:
- Optimize the code so it uses less cycles, thus decompressing files faster
- Make an SA-1 port. As DEFLATE uses a bit stream, you might want to look into SA-1 register $2258
To count clocks, use p4plus2's custom build of lsnes. link. These are the sources which build on Linux. I tried it myself using a virtual machine but as I'm a complete newbie to Linux I couldn't manage, so it was p4 who really did all the benchmarking. Maybe someone else can manage to put the following together and and point people to the right direction.
Note the STZ $5003/5005 in the code. These are special debugger registers p4plus2 added to lsnes. First STZ starts the timer and the second one stops it. Also, additionally: dump.lua
Code
function on_frame_emulated() if movie.framecount() < 50 then return end local file = io.open("path/to/gfx00.dump.bin", "wb") if file then local tmp = memory2.WRAM:readregion(0x2000, 0x1000) for i=1,#tmp do file:write(string.char(tmp[i])) end file:close() end exec("quit-emulator") end
Quote
asar -nocheck inflate.asm homebrew.sfc && lsnes --rom=~/programming/snes/compression/inflate/homebrew.sfc --lua=~/programming/snes/compression/inflate/dump.lua --unpause && sha1sum GFX00.*
so in one command it compiles, runs for 50 frames, dumps the decompressed memory, exits the emulator, and computes the sha1 files for GFX00.bin and GFX00.dump.bin
so in one command it compiles, runs for 50 frames, dumps the decompressed memory, exits the emulator, and computes the sha1 files for GFX00.bin and GFX00.dump.bin
Also, opcode reference with cycle count
This brings back memories. Let's make a miracle happen again!