I'm pretty sure that this can be optimized even more. But seeing that I hit my limit, I have no idea how. I'd prefer advanced ASM hackers to contribute to this optimization. But other people can try too.
With optimizing I mean faster code, even if it costs ROM space. This means that the code should use minimal amount of cycles. It's all for the sake of decreasing level loading times and what not. It would be awesome if we actually saw some visible faster loading time.
I just added in a little piece of code that stores the Y value (which contains the size of the decompressed ExGFX file) to $8D/$8E before the high byte gets destroyed by the SEP #$10 at $00B8EA.
This is useful for patches that need to upload arbitrary-sized ExGFX files without having to include a full copy of this routine in their code just to get the decompressed size.
$8D/$8E is already overwritten many times in the decompression routine and doesn't contain any useful information afterwards, so it's a good address to use.
EDIT: I'm also still leaning towards coding a GZIP decompression routine for SMW and implementing it into LM - not necessarily for faster decompression speeds, but because GZIP can compress to decently smaller sizes than LC_LZ2 in some cases.
GZIP is a pretty standardized compression format, and I've already found quite a few documents on it, I just haven't gotten around to coding a 65c816-version of the decompressor.
Code
HEADER
LOROM
!Freespace = $1D8000
org $80B8E3
JML Decomp_Start
macro ReadByte()
LDA [$8A]
LDX $8A
INX
BNE +
LDX.w #$8000
INC $8C
+ STX $8A
endmacro
org !Freespace
Decomp:
.Return PLB
STY $8D ; store size to $8D
JML $80B8EA
.Start PHB
LDA $02
PHA
PLB
.Loop %ReadByte()
CMP #$FF
BEQ .Return
STA $8F
AND #$E0
CMP #$E0
BEQ +
PHA
LDA $8F
REP #$20
AND.w #$001F
BRA .Label2
+ LDA $8F
ASL #3
AND #$E0
PHA
LDA $8F
AND #$03
XBA
%ReadByte()
REP #$20
.Label2 INC A
STA $8D
SEP #$20
PLA
BEQ .Label3
BPL .NextUp
.Label4 %ReadByte()
XBA
%ReadByte()
TAX
REP #$20
LSR $8D
LDA $8D
BEQ .LoopEnd0
- PHY
TXY
LDA ($00),y
PLY
STA ($00),y
INY #2
INX #2
DEC $8D
BNE -
.LoopEnd0 SEP #$20
BCS +
JMP .Loop
+ PHY
TXY
LDA ($00),y
PLY
STA ($00),y
INY
JMP .Loop
.NextUp ASL A
BPL .Label5
ASL A
BPL .Label6
%ReadByte()
LDX $8D
- STA ($00),y
INC A
INY
DEX
BNE -
JMP .Loop
.Label3 %ReadByte()
STA ($00),y
INY
LDX $8D
DEX
STX $8D
BNE .Label3
JMP .Loop
.Label5 %ReadByte()
LDX $8D
- STA ($00),y
INY
DEX
BNE -
JMP .Loop
.Label6 %ReadByte()
XBA
%ReadByte()
XBA
REP #$20
LSR $8D
LDX $8D
BEQ .LoopEnd
- STA ($00),y
INY #2
DEX
BNE -
.LoopEnd SEP #$20
BCS +
JMP .Loop
+ STA ($00),y
INY
JMP .Loop
Got rid of the indirect stuff because this is faster+you can use X so no expensive shuffling stuff in and out of Y. Your size thing should be preserved too, edit:
Code
HEADER
LOROM
!Freespace = $1D8000
org $80B8E3
JML Decomp_Start
macro ReadByte()
LDA [$8A]
LDX $8A
INX
BNE +
LDX.w #$8000
INC $8C
+ STX $8A
endmacro
org !Freespace
Decomp:
.Return PLB
REP #$20
TYA
SEC
SBC $00 ;sub starting pointer
STA $8D ; store size to $8D
SEP #$20
JML $80B8EA
.Start PHB
LDA $02
PHA
PLB
LDY $00 ;16bit pointer in Y
.Loop %ReadByte()
CMP #$FF
BEQ .Return
STA $8F
AND #$E0
CMP #$E0
BEQ +
PHA
LDA $8F
REP #$20
AND.w #$001F
BRA .Label2
+ LDA $8F
ASL #3
AND #$E0
PHA
LDA $8F
AND #$03
XBA
%ReadByte()
REP #$20
.Label2 INC A
STA $8D
SEP #$20
PLA
BEQ .Label3
BPL .NextUp
.Label4 %ReadByte()
XBA
%ReadByte()
REP #$21
ADC $00 ;X needs to be offset by original pointer
TAX
LSR $8D
LDA $8D
BEQ .LoopEnd0
- LDA $0000,x
STA $0000,y
INY #2
INX #2
DEC $8D
BNE -
.LoopEnd0 SEP #$20
BCS +
JMP .Loop
+ LDA $0000,x
STA $0000,y
INY
JMP .Loop
.NextUp ASL A
BPL .Label5
ASL A
BPL .Label6
%ReadByte()
LDX $8D
- STA $0000,y
INC A
INY
DEX
BNE -
JMP .Loop
.Label3 %ReadByte()
STA $0000,y
INY
LDX $8D
DEX
STX $8D
BNE .Label3
JMP .Loop
.Label5 %ReadByte()
LDX $8D
- STA $0000,y
INY
DEX
BNE -
JMP .Loop
.Label6 %ReadByte()
XBA
%ReadByte()
XBA
REP #$20
LSR $8D
LDX $8D
BEQ .LoopEnd
- STA $0000,y
INY #2
DEX
BNE -
.LoopEnd SEP #$20
BCS +
JMP .Loop
+ STA $0000,y
INY
JMP .Loop
will contribute more when i'm not so tired, pardon any stupid mistakes but they should be easy to spot and correct.
If [$00] points to $12:FFFF for example it will read upper byte from $13:0000 instead of $13:8000.
Code
macro ReadWord()
LDA [$00],y
INY #2
BMI +
PHP
LDY #$8000
SEP #$20 ;is $03 used for anything? can save a cycle without the SEP
INC $02
PLP
BEQ +
XBA ;it's the high byte that got affected
SEP #$20
LDA [$00],Y
XBA
INY
REP #$20
+
endmacro
didn't test that so point out anything that doesn't look quite right, ofcourse =)
edit: since push/pull is actually slower than STA dp/LDA dp, some savings can be made by placing them in unused DP space for that routine.
@ersanio: the unrollable stuff will be good to convert to DMA, except word fill that would be pretty awkward. The copy/fill(byte) loops are suitable though.
@Min: Self modifying code with MVN is better than load/store, but DMA WRAM->SRAM->WRAM is 2 cycle/byte not counting setup. A few days ago there was a chat about just using DMA (would also work for byte copy).
@ersanio/others: For byte fill you can just use DMA and set DMA to not increment the address so it reads the same byte every time. 1 cycle per byte transferred like that to $2180. Just have to store it to some byte in SRAM because WRAM->WRAM transfer not allowed.
Copy can do $2180->$70xxxx->$2180 but some SRAM must be reserved. Much faster, but more akward to use. MVN is easy so it depends on what we decide. If we ever have to copy like 200 bytes or something (large monocolored area, repeating tile sequence etc) then DMA will be massively faster than MVN. For small quantities, MVN is OK due to DMA setup time.
By the end when everyone has put in their contribution it should be much, much faster than the original.
@Japan guys: if there is anyone else to bring please bring them since the communities are kind of divided for whatever reason. Even if the English is not too good it is still very easy to contribute to projects like these.
Re-added the size to $8D thing. I had to move one block of code to prevent it from causing out-of-range branches.
It will be very helpful if this stays in there
also added RATS tag, and insert-size counter (current size is 351 bytes)
Seeing as the JSL/RTL only gets called about 20 times max per level, you only lose about 1 microsecond give or take. Also, because there is potential for a hacker to need the decompression routine for whatever reason, I argue that it should stay as an RTL. I hate to keep going back to SMAS, but I do intend on using this routine in more than one spot. Point is, it can happen in SMW too, so I'd say the beneficiary comfort of keeping an RTL for convenience outweighs the additional microsecond we save in time.
Just my opinion. Anyone can rebut. ----------
Interested in MushROMs? View its progress, source code, and make contributions here.
Seeing as the JSL/RTL only gets called about 20 times max per level, you only lose about 1 microsecond give or take. Also, because there is potential for a hacker to need the decompression routine for whatever reason, I argue that it should stay as an RTL.
Agreeing with this. It's literally impossible to notice the difference and just causes inconvenience, so there's little reason to keep it. The focus should be on optimizing the loop content.
for some reason the DMA version is screwing up the new (yet-to-be-released) Layer3ExGFX that reloads GFX on submap change (including FG slots) w/o fblank. It causes layer 1 and sprites to "flicker" above the windowing effects. I think it might have something to do with the DMA transfer that uses channel 6, because when I comment out the STA $420B it stops flickering (of course messes up the GFX load too though), but when I change the channel it doesn't help at all.
I'm looking into a solution right now, because I really think this should be addressed.
EDIT: I just tested it in SNES9x, and the problem isn't exactly the same but it is there. the whole screen flickers black as if entering fblank every other frame.
I guess the routine has a problem running outside of blank, which is a problem for any "OW ExGraFix" patches that reload GFX on submap change without fblank
EDIT2: BSNES does the same
EDIT3: I'm not even sure if the DMA version is faster. And if it is slightly faster, I think we should continue to use the non-DMA version because compatibility > tiny speed increases.
@edit: It writes to CH5/CH6 registers, any possibility CH5 is messing things up for whatever you are doing? The registers themselves are readable so maybe push them before entering the routine and see if it helps.
Follow Us On