19 users online:  Anorakun, autisticsceptile1993, E-man38, Evernn, FireSeraphim, Green Jerry, Hammerer,  idol, Josuke Yoshikage, kaigem, Mr. Rudolph, Neuromancer, ryk Meurer, Samie Zuccati, Skye, Taffy, The_Kingslayer17, TheGelF3000, Ultima - Guests: 112 - Bots: 199
Users: 62,835 (2,626 active)
Latest user: TheGelF3000

Move over, LC_LZ3! DEFLATE for the SNES (WIP, ASM help is very welcome!)

A few years ago, I think p4plus2 let me know about the existence of a certain piece of 6502 code. Back then, I didn't understand so much about it and just shrugged it off as "complicated". Recently, I decided to revisit this and noticed that this is actually something very cool and could totally be implemented into SMW.

Let's get right into the good stuff.

The code

Inflate code (2018-07-11 10:54:40 AM @ 4753212 clocks, originally 7695346)
Compressed test files
inflate, before all optimization magic.

p4plus2 is the main contributor to the code while I assisted him with stuff like producing debug files and a bit of optimizing. He wrote the entire thing himself (it's not based off of the repo link from earlier). The code is useable (and impressive), but in terms of performance, it's still unfinished and you'll find out why as you read this post. At the moment it assembles into a homebrew ROM we use for testing. asar in the tools section should work just fine for this.

The code currently runs assuming there are no interrupts. When we/you port this to Super Mario World, keep this in mind. Although interrupts should be disabled during level load already.

What is DEFLATE?

DEFLATE is a 'popular' compression method. (Decompressing it is called "inflate"). It is popular in the sense of it being used even in modern-day compression, such as in gzip or PNGs. The usage is so widespread that certain languages have support for DEFLATE compression readily available. Interestingly enough, DEFLATE wasn't invented until a few years after the SNES was released. The format of the compression is discussed later.


zopfli made by Google excels at compressing in DEFLATE, maximizing compression but being slow as a result (at least, when you tell it to be slow). I compared the current SMW compression methods with DEFLATE. Here are some legitimately exciting numbers.


GFX00		0x1000		0x8B6		0x77C		0x667		0x115
GFX01		0x1000		0xC20		0xAE8		0x94E		0x19A
GFX02		0x1000		0x96D		0x864		0x6F6		0x16E
GFX03		0x1000		0xB8E		0xA39		0x8EB		0x14E
GFX04		0x1000		0xB0E		0x9C6		0x848		0x17E
GFX05		0x1000		0xB22		0x9E9		0x84E		0x19B
GFX06		0x1000		0xA7E		0x97A		0x825		0x155
GFX07		0x1000		0xA33		0x911		0x799		0x178
GFX08		0x1000		0xA45		0x8F6		0x7BF		0x137
GFX09		0x1000		0xC27		0xAEF		0x8F6		0x1F9
GFX0A		0x1000		0xAED		0x907		0x831		0xD6
GFX0B		0x1000		0xA95		0x9A9		0x7D7		0x1D2
GFX0C		0x1000		0x89E		0x7B8		0x697		0x121
GFX0D		0x1000		0x7CC		0x6B8		0x5CD		0xEB
GFX0E		0x1000		0xA38		0x903		0x788		0x17B
GFX0F		0x1000		0x91C		0x7F4		0x666		0x18E
GFX10		0x1000		0x96D		0x804		0x6D1		0x133
GFX11		0x1000		0x8E6		0x7FF		0x6F0		0x10F
GFX12		0x1000		0xA5D		0x965		0x812		0x153
GFX13		0x1000		0xB3E		0x9F8		0x89C		0x15C
GFX14		0x1000		0x745		0x62D		0x55D		0xD0
GFX15		0x1000		0x940		0x825		0x6A1		0x184
GFX16		0x1000		0x624		0x58A		0x4BA		0xD0
GFX17		0x1000		0x9CE		0x871		0x743		0x12E
GFX18		0x1000		0xA2F		0x8EE		0x7E0		0x10E
GFX19		0x1000		0x7A3		0x6D5		0x5C3		0x112
GFX1A		0x1000		0x95E		0x899		0x75E		0x13B
GFX1B		0x1000		0x8B2		0x75A		0x623		0x137
GFX1C		0x1000		0x840		0x731		0x606		0x12B
GFX1D		0x1000		0xB77		0xA8A		0x8C1		0x1C9
GFX1E		0x1000		0x91C		0x7FC		0x6D1		0x12B
GFX1F		0x1000		0x895		0x7BA		0x6AB		0x10F
GFX20		0x1000		0x9A2		0x8E2		0x732		0x1B0
GFX21		0x1000		0xA6B		0x962		0x7C1		0x1A1
GFX22		0x1000		0x975		0x85A		0x751		0x109
GFX23		0x1000		0x9A6		0x8A1		0x749		0x158
GFX24		0x1000		0x9BA		0x8D2		0x724		0x1AE
GFX25		0x1000		0xBD3		0xAA3		0x89D		0x206
GFX26		0x1000		0xA01		0x851		0x7B8		0x99
GFX27		0xC00		0x8D5		0x854		0x6B0		0x1A4
GFX28		0x800		0x5F4		0x574		0x47E		0xF6
GFX29		0x800		0x4AB		0x439		0x381		0xB8
GFX2A		0x800		0x4C9		0x3F8		0x338		0xC0
GFX2B		0x800		0x6BE		0x655		0x546		0x10F
GFX2C		0x1000		0x921		0x829		0x6D3		0x1EC
GFX2D		0x1000		0xA5A		0x8E4		0x7C9		0x11B
GFX2E		0x1000		0x8EA		0x7D8		0x65A		0x17E
GFX2F		0x400		0x1C9		0x182		0x138		0x4A
GFX30		0x800		0x524		0x4D1		0x3CE		0x103
GFX31		0x800		0x5D3		0x4EC		0x437		0xB5
GFX32		0x5D00		0x388F		0x329D		0x2AE2		0x7BB
GFX33		0x3000		0x1C68		0x18F7		0x1548		0x3AF

Total bytes saved using LZ3 rather than LZ2, for GFX files 0-33: 0x3CA4

Total bytes saved using DEFLATE rather than LZ3, for GFX files 0-33: 0x4A53
(and thus, 0x3CA4 + 0x4A53 = 0x86F7 bytes saved compared to LZ2)

*zopfli has an option to iterate multiple times for even more compression, at the cost of time.
Although the limit to the amount of iterations is 99999, increasing it after a certain threshold 
won't make any significant difference. This is evident when you use the verbose argument during 

As you can see, using DEFLATE saves over half a bank of space compared to LC_LZ3. That is quite incredible. You can notice that the compression really goes at it, if you open the compressed files in a tile editor; in DEFLATE you pretty much don't see any trace of the original graphics at all. Here's GFX00.bin:

However, compression isn't only limited to graphics. It could also be used to compress other assets which use a lot of space, particularly, custom music.

Music compression. Using the -p argument of AddmusicK, we dumped the music
into .bin files. These are the original SMW songs, inserted with AddmusicK's
supplied music list, without any alterations whatsoever.

MUSIC FILE	Raw		DEFLATE(9999*)	(Raw - DEFLATE)		Song name
main.bin	0x26FC		0x1C28		0xAD4			(Is this the SPC engine?)
music01		0x77		0x75		0x2			Miss.txt
music02		0xA9		0x9C		0xD			Game Over.txt
music03		0x1CB		0x147		0x84			Boss Clear.txt
music04		0x168		0x11E		0x4A			Stage Clear.txt
music05		0xBC		0x95		0x27			Starman.txt
music06		0xB0		0x84		0x2C			P-switch.txt
music07		0x98		0x5B		0x3D			Keyhole.txt
music08		0x46		0x3D		0x9			Iris Out.txt
music09		0xD9		0xA8		0x31			Bonus End.txt

music0A		0x4E9		0x2A4		0x245			Piano.txt
music0B		0x61A		0x375		0x2A5			Here We Go.txt
music0C		0x476		0x249		0x22D			Water.txt
music0D		0x46F		0x2C8		0x1A7			Bowser.txt
music0E		0x4F6		0x2B9		0x23D			Boss.txt
music0F		0x18A		0x127		0x63			Cave.txt
music10		0x367		0x1F4		0x173			Ghost.txt
music11		0x71A		0x40B		0x30F			Castle.txt
music12		0x29B		0x19E		0xFD			Switch Palace.txt
music13		0x271		0x193		0xDE			Welcome.txt
music14		0x99		0x63		0x36			Rescue Egg.txt
music15		0x4E1		0x344		0x19D			Title.txt
music16		0x106		0xD4		0x32			Valley of Bowser Appears.txt
music17		0x12B		0xE8		0x43			Overworld.txt
music18		0x134		0xFC		0x38			Yoshi's Island.txt
music19		0x1DA		0x163		0x77			Vanilla Dome.txt
music1A		0x130		0x115		0x1B			Star Road.txt
music1B		0x155		0x117		0x3E			Forest of Illusion.txt
music1C		0x10C		0xC9		0x43			Valley of Bowser.txt
music1D		0x822		0x391		0x491			Special World.txt
music1E		0x39		0x31		0x8			IntroScreen.txt
music1F		0x37F		0x22D		0x152			Bowser Scene 2.txt
music20		0x37F		0x22D		0x152			Bowser Scene 3.txt
music21		0x311		0x14B		0x1C6			Bowser Defeated.txt
music22		0x119		0xE0		0x39			Bowser Interlude.txt
music23		0x115		0xA0		0x75			Bowser Zoom In.txt
music24		0x26C		0xFF		0x16D			Bowser Zoom Out.txt
music25		0x173		0x110		0x63			Princess Peach is Rescued.txt
music26		0x1148		0x575		0xBD3			Staff Roll.txt
music27		0x14D		0xCC		0x81			The Yoshis Are Home.txt
music28		0x2022		0x78C		0x189C			Cast List.txt

Total bytes saved using DEFLATE for all default AddmusicK SMW music: 0x4766 (not counting main.bin)

*zopfli has an option to iterate multiple times for even more compression, at the cost of time.
Although the limit to the amount of iterations is 99999, increasing it after a certain threshold 
won't make any significant difference. This is evident when you use the verbose argument during 

Compressing all the original SMW music into DEFLATE format also saves over half a bank of space. A noticeable trend here is that as you compress bigger files, the compression gets more and more efficient. See: the difference between music01 and music28. However, to decompress songs you'll need a particularly large RAM buffer. GFX usually requires 0x1000 bytes per file, but music can exceed that. Then again, free RAM shouldn't be a problem during level load, but dynamically uploading songs to the ARAM during gameplay (is that a thing in present-day SMW hacking?) could cause problems. Either way, this is something that is very well within the realm of possibility.

For the sake of even more statistics, I also compressed the (optimized) SMW samples supplied with AddmusicK. This one is interesting, as samples technically speaking are already compressed (BRR). Here are the results:

Sample compression. The compressed samples are the optimized ones supplied
with AddmusicK.

00 SMW @0	0x26		0x23		0x3
01 SMW @1	0x26		0x24		0x2
02 SMW @2	0x89		0x8A		-1
03 SMW @3	0xB6		0xB7		-1
04 SMW @4	0x26		0x23		0x3
05 SMW @8	0x119		0x114		0x5
06 SMW @22	0x266		0x248		0x1E
07 SMW @5	0x3A1		0x37E		0x23
08 SMW @6	0x26		0x24		0x2
09 SMW @7	0x7A3		0x6CB		0xD8
0A SMW @9	0x49D		0x438		0x65
0B SMW @10	0x36B		0x352		0x19
0C SMW @13	0x26		0x23		0x3
0D SMW @14	0x185		0x183		0x2
0E SMW @29	0x599		0x54C		0x4D
0F SMW @21	0x122		0x120		0x2
10 SMW @12	0x43A		0x406		0x34
11 SMW @17	0x230		0x220		0x10
12 SMW @15	0x5F3		0x59B		0x58
13 SMW Thunder	0x953		0x8B9		0x9A

Total bytes saved using DEFLATE for all default AddmusicK optimized samples: 0x32E

The space saved from compressing samples is negligible. The compressed files barely have a difference in size compared to the raw files. Furthermore, samples don't seem to be large to begin with, but custom samples could be a different story.

Finally, p4plus2 also benchmarked the performance of the current ASM decompression algorithms using GFX00.bin as a base. Here are the results:
PERFORMANCE: Decompressing GFX00.bin
lz2 (original)	1638070 clocks
lz2 (optimized)	599616 clocks
lz3		1041172 clocks
DEFLATE 	4753212 clocks

We tried aiming for at least "lz2 (original)" as it's SMW's original decompression routine. Loading times of levels shouldn't be too noticeable with that performance. However, we're still a long shot away from that milestone. We would like some help with it, if anyone is up to the challenge. More details at the end of the post.

The format

Although I don't fully comprehend DEFLATE myself, I'll try my best explaining the basics. DEFLATEd data consists of blocks, each with a 3-bit header: XXXX XYYZ. DEFLATE rather uses a bit stream for the Huffman-compressed (explained later) data, meaning that information within the compressed data isn't "byte-aligned".

Z = Is final block to process? 1 for final, 0 for more blocks to come
YY = Block command. 00 = direct copy, 01 = fixed Huffman tree, 10 = dynamic Huffman tree, 11 = unused/error.

Direct Copy
"The stored block option adds minimal overhead, and is used for data that is incompressible." ~ Wikipedia. Basically copy data directly to the output. Direct copy has the following header bytes: [inflate_command_bits][length][length][!length][!length]
Where "length" make a 16-bit value of the length of the data that should be copied, and "!length" is the inverse value. e.g. [0x1000][0xEFFF]

Fixed/Dynamic Huffman tree
Huffman code is a way to turn text into a bit stream, by assigning the most common letters with the least data-consuming bits representation, while assigning the least common letters with the most data-consuming bits representation. As a result, data gets compressed. For an explanation, visit this video. You can code data into bits. Decoded data from said bits are called "symbols". It's a very fascinating technique. This isn't limited to just text, however, but it's the least abstract example.

The difference between fixed/dynamic trees is that fixed blocks use a pre-defined Huffman tree inside the decompressor code (so we don't need to store a dictionary inside the compressed data), while the dynamic Huffman trees store the dictionary inside the compressed data following the block header.

The Huffman trees also have the capability of backreferencing. If the decoded symbol happens to be a value over 255, it means that the decompressor should copy an already decompressed block. This is useful for repeating data. Kind of like the "Repeat" command in LC_LZ2.

It is possible that compressed files solely consist out of a single block with a ton of backreferences. For example, GFX00.bin DEFLATEd is a single dynamic block with 299 backreferences.

We need help!

If you saw the statistics from earlier, you probably saw how DEFLATE is a superior compression method but is quite slow. Therefore, we'd like help on two fronts:

- Optimize the code so it uses less cycles, thus decompressing files faster
- Make an SA-1 port. As DEFLATE uses a bit stream, you might want to look into SA-1 register $2258

To count clocks, use p4plus2's custom build of lsnes. link. These are the sources which build on Linux. I tried it myself using a virtual machine but as I'm a complete newbie to Linux I couldn't manage, so it was p4 who really did all the benchmarking. Maybe someone else can manage to put the following together and and point people to the right direction.

Note the STZ $5003/5005 in the code. These are special debugger registers p4plus2 added to lsnes. First STZ starts the timer and the second one stops it. Also, additionally: dump.lua
function on_frame_emulated()
    if movie.framecount() < 50 then
	local file ="path/to/gfx00.dump.bin", "wb") 

	if file then 
		local tmp = memory2.WRAM:readregion(0x2000, 0x1000)
		for i=1,#tmp do

asar -nocheck inflate.asm homebrew.sfc && lsnes --rom=~/programming/snes/compression/inflate/homebrew.sfc --lua=~/programming/snes/compression/inflate/dump.lua --unpause && sha1sum GFX00.*
so in one command it compiles, runs for 50 frames, dumps the decompressed memory, exits the emulator, and computes the sha1 files for GFX00.bin and GFX00.dump.bin

Also, opcode reference with cycle count

This brings back memories. Let's make a miracle happen again!
The Hacking of PuyoPuyo. Come join the fun, friends. 『いけいけ団長、頑張れ頑張れ団長!』
Help us raise funds for the Armed Forces of Ukraine. #ДопомагаємоРазом / #HelpTogether
“Even if you personally are so dissatisfied with life that you want the world to end, surely the cruel reality is that it will continue on, unchanging. All the better for someone perfectly content, like me.”
Aya Shameimaru, Touhou Suzunaan ~ Forbidden Scrollery
Has science gone too far? Holy crap this is impressive. Maybe someday in the distant future SMW hacking can get much more, custom on a large scale. Great, FANTASTIC work guys!
Currently working on:
Super Kekcroc World 4 (lol jk)
Originally posted by wiiqwertyuiop
What are you using to count the clocks?

Oh yeah, good one. Updated the final paragraph with additional information.

Has science gone too far?

Not until you have seen inline_decode_symbol!
My blog. I could post stuff now and then

My Assembly for the SNES tutorial (it's actually finished now!)
As expected, this is indeed quite impressive. If you manage to get it reasonably fast, this could be quite the leap for SMW hacking and allow hacks to get much bigger (thinking of SMWCP2, for example, which I think is still struggling because of a lack of free space in the ROM). Even without optimization and without application in SMW hacks, there is still quite some potential here for the homebrew scene. I'm definitely curious to see where this will go. I can't offer much help myself, have never been a hardware expert or an expert at performance optimization, plus, I might legitimately die if I took yet another side project, haha xd, but I know there's a lot of talented SNES programmers around and I'd be surprised if none of them showed any interest in this.
Feel free to visit my website/blog - it's updated rarely, but it looks pretty cool!
Whoa, this is really neat. Though, I wonder if it would be feasible to write a realtime decompression routine for this, eliminating the need for a large RAM buffer. I'm thinking that could come in handy for saving a ton of ARAM space for custom music (not sure if the SPC700 would be able to handle it though).

Either way, this could definitely save a ton of ROM space and it sounds incredibly useful. I hope more optimizations are done on this in the future. #smw{:TUP:}
This is really, really amazing. I can already see in my mind people using many GFX files in their ROM hacks to make them more detailed. It would be cool if this gets implemented in Lunar Magic as an option in the future. Good job.

I would like to help but it seems very complicated for my ASM level.

Userbar by Green Jerry

Also a Fortaleza Reznor user. If you... digo, si hablas español, hackeas, buscas ayuda, o simplemente se te da conocer gente, únete, somos puerta abierta.
Originally posted by Jimmy
Whoa, this is really neat. Though, I wonder if it would be feasible to write a realtime decompression routine for this, eliminating the need for a large RAM buffer. I'm thinking that could come in handy for saving a ton of ARAM space for custom music (not sure if the SPC700 would be able to handle it though).

Realtime would be nice. Personally I'm not sure if it's possible, but I think the fastest way would be having SA-1 decompress it then upload it to ARAM, rather than making the SPC-700 decompress it.

Originally posted by RPG Hacker
As expected, this is indeed quite impressive. If you manage to get it reasonably fast, this could be quite the leap for SMW hacking and allow hacks to get much bigger (thinking of SMWCP2, for example, which I think is still struggling because of a lack of free space in the ROM). Even without optimization and without application in SMW hacks, there is still quite some potential here for the homebrew scene.

Indeed. The code is written in such a way that it can be ported over to pretty much anywhere. Imagine using this on Super Mario All-Stars. All the graphics there are uncompressed and it would save so much space but haha who am i kidding, i haven't made any progress with smas for months but anyway i digress

I do remember ROM space being an issue in one collab, which is one of the reasons what got me looking at this DEFLATE project again. Seeing how it beats LC_LZ3 really makes me happy, but if we can't find many optimization points it might become an "SA-1 recommended" type of feature.

I will be updating the main post's code link with the time and the amount of clocks it currently has as we keep optimizing it. p4 and I are still working on this as we speak.
My blog. I could post stuff now and then

My Assembly for the SNES tutorial (it's actually finished now!)
Double post, but we reached a milestone!

When we first started this project and p4 wrote the initial code, we had a clock count of 7695346. After hard work and heavy wizardry, we've dropped the clock count to 4753212. It's less than the 6502 code which we also optimized to 4787552 clocks. That's a difference of 34340 clocks. That is also our goal (I mean, for a complicated format like this, I think that's fine). However, we can't find any improvement points anymore.

p4 showed the code to some TASbot veterans (who also have experience with optimizing SNES code) and they say that they can't find improvement points either.

Before we reach out to FuSoYa about this decompression routine we'd also like to see an SA-1 version of this. If it is any useful, here is the initial version of our code before all the optimization magic. inflate, initial version. Will be on first post also.

Bonus: Someone could also make a SuperFX version of this. It could be a fun project, but Yoshi's Island's hacking scene could totally benefit from this, especially because SuperFX ROMs to my knowledge are pretty restrictive in size.
My blog. I could post stuff now and then

My Assembly for the SNES tutorial (it's actually finished now!)
Neat! I've often thought about popping back onto the scene to write something like this. My biggest blocker earlier on was not being able to find intuitive (yet complete) documentation.

One way we could still improve runtimes is to restructure the algorithm to have 8 different code blocks per part of its functionality, and which one gets used depends on where we are w.r.t. byte alignment. Then, instead of dispensing bits individually or keeping track of the byte alignment offset explicitly, each block is just hardcoded with the jumps/branches to the correct blocks for other parts of the algorithm's functionality.

For example, the block for header reading at bit alignment offset 2 would then branch/jump to the offset-4 block for whatever functionality applied to the following data.

Each block can then be individually optimized (and tested!). I used this strategy in a simpler assembly decompression project with success. The downside is, of course, larger code size. But for the savings it could be worth it.

Regardless, I would say we should remember to store the final 16-bit decompressed size to $8D-$8E. This wasn't in original LZ2 code, but was added to optimized LZ2 and to LZ3. A handful of patches depend on this.
p4plus2 - Today at 12:15 PM
I thought about something like that, but due to how decode_symbol works I am not sure how useful it would be
Ersanio - Today at 12:16 PM
so it's not optimal unless you pretty much start rewriting things?
p4plus2 - Today at 12:17 PM
well decode_symbol works 1 bit at a time
and the symbol lengths are dynamic and decoded as you read a bit
Ersanio - Today at 12:17 PM
p4plus2 - Today at 1:00 PM
also to be fair I hard coded several of the bit readers if you look carefully
when they were constants
Ersanio - Today at 1:03 PM
so basically it comes down to that edit's suggestion will be a lot of work for possibly less savings?
p4plus2 - Today at 1:04 PM
I think so
and unlike LZ2 things are never byte aligned really....

So yeah. I think that unless we get an SA-1 or a SuperFX version, we'll stick to this performance.
My blog. I could post stuff now and then

My Assembly for the SNES tutorial (it's actually finished now!)
I was considering creating a SuperFX version of DEFLATE mainly for fun but also thinking that maybe it could help with YI hacking. I ran a bunch of tests to see if this is really worth it and got the following results: link
Some results are N/A as Lunar Compress refuses to compress larger/tiny-er files.

The link contains a comparison of all GFX files compressed in YI in both LC_LZ1 and LC_LZ16 as well as DEFLATE. I extracted all the GFX files using Ycompress, then recompressed them using Lunar Compress, in LC_LZ1 and LC_LZ16 format. I don't know which GFX files are compressed in which format in the original Yoshi's Island, so I just compressed them all in both formats.

Ultimately, the results are:
Bytes saved when using DEFLATE rather than LZ1 (assuming all files are compressed this way): 178472
Bytes saved when using DEFLATE rather than LZ16 (assuming all files are compressed this way): 76689 (not counting N/A files)

edit: for fun, I ran the worst case scenario compression, picking the worst compression delta value from the final 2 columns. GFX001 I picked -762, GFX002 I picked 311, etc. In the worst case scenario you save -2817 bytes.

Those numbers are immensely high, yet I have mixed feelings about LC_LZ16. In the file, you can see that most files are actually compressed better in LC_LZ16 compared to DEFLATE. I checked Raidenthequick's YI disassembly to see what the format is like and if I'm not mistaken, apparently it uses PLOT instructions to draw the decompressed graphics, which is actually quite interesting.

According to my data, there is still space saved, but the data of LC_LZ16 might be wrong as there is a "Format2" parameter of LC_LZ16. Here is Lunar Compress' definition of LC_LZ16:
LC_LZ16   SMW2/YI [G]     A format decompressed by the SFX chip.  You must
                           set Format2 to the number of 8x8 tile rows that
                           you want... setting it higher than the number of
                           actual rows in the compressed structure may
                           cause decompression to fail.

Setting a compression format of "0 rows" automatically causes decompression to fail, yet the the files do get compressed anyway. Therefore, I ran some tests with some of the files, GFX00C.bin (2bpp, 16 rows), GFX1C6.bin (4bpp, 8 rows), GFX046.bin (4bpp, 8 rows) and GFX00B.bin (4bpp, 16 rows)

Format2 = 0 = 0x967 bytes output, but fails decompressing
Format2 = 8 = 0x967 bytes output
Format2 = 16 = 0x989 bytes output

Format2 = 0 = 0x71D bytes output, but fails decompressing
Format2 = 8 = 0x71D bytes output

Format2 = 0 = 0x697 bytes output, but fails decompressing
Format2 = 8 = 0x697 bytes output

Format2 = 0 = 0x475 bytes output, but fails decompressing
Format2 = 16 = 0x475 bytes output

From this test I can conclude that Lunar Compress counts the amount of 8x8 rows in 4bpp format whenever it compresses with Format2 parameter "0". This means that it can't differentiate between 4bpp and 2bpp files, so the 2bpp compressed filesizes are somewhat inaccurate.

Finally, it is worth mentioning that for DEFLATE I used an iteration count of 999 this time rather than 9999 (because compressing 473 files with 9999 iterations each would take not one but two eternities to finish). I specifically recompressed GFX1C6.bin with 9999 iterations to see the filesize difference with 999 iterations:

999 iterations = 0x920 bytes compressed
9999 = 0x91F bytes compressed

It's interesting to see LC_LZ16 beat DEFLATE on many of the GFX files but ultimately, DEFLATE wins.

Conclusion: Is it worth to port DEFLATE to SuperFX? The answer is yes, with a but. Compressing the current graphics could save several banks of space, depending on which files use LC_LZ1 and LC_LZ16. It'd be nice to know which GFX files use which compression so we can calculate the actual saved space.

If YI hacking has/will get ExGFX support, it may be worth looking into whether they should be compressed in LC_LZ16 instead of DEFLATE as LC_LZ16 seems to beat DEFLATE numerous times.
My blog. I could post stuff now and then

My Assembly for the SNES tutorial (it's actually finished now!)
is DEFLATE faster than LC_LZ3?? how many cycles needs snes to decompress an average 4bpp gfx with DEFLATE compared with LC_LZ3?


SMWControlLibX GitHub
My Discord Server
We compared the speed using GFX00.bin as a base. DEFLATE takes about 4.5 times longer compared to LC_LZ3, which is why it'd be awesome if we had an SA-1 version. The exact numbers are on the first post.

I think at the current performance, decompressing all (Ex)GFX for a level would take about 2 seconds.
My blog. I could post stuff now and then

My Assembly for the SNES tutorial (it's actually finished now!)
How about to use a better algorithm?

here is a paper with compression comparison, maybe it can be helpful .

Paper here

Maybe a good choice is brotli:9 because similar decompression speed (a little faster than DEFLATE) and very good compression ratio(better than deflate Compression ratio).

Also could be a good option LZ4 because is one of the fastest of all compression algorithms.

LZ4 Comparison

How Works LZ4


SMWControlLibX GitHub
My Discord Server
Tried both Brotli and LZ4 on GFX00.bin and the short story is that DEFLATE is still a better method.

DEFLATE compresses GFX00.bin into 1640 bytes.
Brotli compresses GFX00.bin into 1651 bytes.
LZ4 compresses GFX00.bin into 2236 bytes.

DEFLATE is the superior compression algorithm here. LZ4 might be faster but speed won't be an issue if we make an SA-1 or SuperFX implementation for DEFLATE. Finally, deciphering Brotli is going to be a nightmare we would rather not get involved with.

We could go really crazy if we want to, but for now, we are more than satisfied with DEFLATE!
My blog. I could post stuff now and then

My Assembly for the SNES tutorial (it's actually finished now!)
Well if you test it then DEFLATE is the best option for SA-1 or Super FX hacks.

Now is to hard to implement LZ4 decomppresion on snes?? because for normal hacks that doesnt use SA-1 or Super FX, could be great have something faster than LZ2.


SMWControlLibX GitHub
My Discord Server