Banner
Views: 236,809,627
Time: 2013-05-26 01:55:07 AM
11 users online: Akireyano, aterraformer, o Blumiere, dov36, DPBOX, Ginyu AL, o HuFlungDu, HyperMario, o Ladida, Masterlink, MrDeePay - Guests: 18 - Bots: 8Users: 22,896 (1,291 active)
Latest: TunaTaco
Tip: Don't remake levels from other Mario games, especially popular ones (SMB1/3 1-1, for example).
Optimize the LC_LZ2 decompression!
Forum Index - SMW Hacking - General SMW Hacking Help - ASM & Related Topics - Optimize the LC_LZ2 decompression!
Pages: « 1 2 »
Lately I've been attempting to optimize the LC_LZ2 decompression routine of SMW. It went perfectly fine until I hit my limit.

Why I'm optimizing this? To decrease the level loading time even if it is for a split-second. My code currently looks as the following:

Code
HEADER LOROM !Freespace = $1D8000 ORG $00B8E3 JML Decomp_start macro ReadByte() LDA [$8A] LDX $8A INX BNE + LDX.w #$8000 INC $8C + STX $8A endmacro ORG !Freespace Decomp: .return JML $00B8EA .start %ReadByte() CMP.b #$FF BEQ .return STA $8F AND.b #$E0 CMP.b #$E0 BEQ + PHA LDA $8F REP #$20 AND.w #$001F BRA .label2 + LDA $8F ASL ASL ASL AND.b #$E0 PHA LDA $8F AND.b #$03 XBA %ReadByte() REP #$20 .label2 INC A STA $8D SEP #$20 PLA BEQ .label3 BPL .nextup .label4 %ReadByte() XBA %ReadByte() TAX - PHY TXY LDA [$00],Y PLY STA [$00],Y INY INX REP #$20 DEC $8D SEP #$20 BNE - JMP.w .start .nextup ASL BPL .label5 ASL BPL .label6 %ReadByte() LDX $8D - STA [$00],Y INC A INY DEX BNE - JMP.w .start .label3 %ReadByte() STA [$00],Y INY LDX $8D DEX STX $8D BNE .label3 JMP .start .label5 %ReadByte() LDX $8D - STA [$00],Y INY DEX BNE - JMP .start .label6 %ReadByte() XBA %ReadByte() LDX $8D - XBA STA [$00],Y INY DEX BEQ + XBA STA [$00],Y INY DEX BNE - + JMP .start


I'm pretty sure that this can be optimized even more. But seeing that I hit my limit, I have no idea how. I'd prefer advanced ASM hackers to contribute to this optimization. But other people can try too.

With optimizing I mean faster code, even if it costs ROM space. This means that the code should use minimal amount of cycles. It's all for the sake of decreasing level loading times and what not. It would be awesome if we actually saw some visible faster loading time.

You can find a list of cycles in this document.
Last edited on 2010-05-24 05:34:45 AM by Ersanio.
I'm sorry. I can't speak English.
So I put only a code.
Really I'm sorry.

Code
HEADER LOROM !Freespace = $1D8000 org $80B8E3 JML Decomp_Start macro ReadByte() LDA [$8A] LDX $8A INX BNE + LDX.w #$8000 INC $8C + STX $8A endmacro org !Freespace Decomp: .Return PLB JML $80B8EA .Start PHB LDA $02 PHA PLB .Loop %ReadByte() CMP #$FF BEQ .Return STA $8F AND #$E0 CMP #$E0 BEQ + PHA LDA $8F REP #$20 AND.w #$001F BRA .Label2 + LDA $8F ASL #3 AND #$E0 PHA LDA $8F AND #$03 XBA %ReadByte() REP #$20 .Label2 INC A STA $8D SEP #$20 PLA BEQ .Label3 BPL .NextUp .Label4 %ReadByte() XBA %ReadByte() TAX REP #$20 LSR $8D LDA $8D BEQ .LoopEnd0 - PHY TXY LDA ($00),y PLY STA ($00),y INY #2 INX #2 DEC $8D BNE - .LoopEnd0 SEP #$20 BCS + JMP .Loop + PHY TXY LDA ($00),y PLY STA ($00),y INY JMP .Loop .NextUp ASL A BPL .Label5 ASL A BPL .Label6 %ReadByte() LDX $8D - STA ($00),y INC A INY DEX BNE - JMP .Loop .Label3 %ReadByte() STA ($00),y INY LDX $8D DEX STX $8D BNE .Label3 JMP .Loop .Label5 %ReadByte() LDX $8D - STA ($00),y INY DEX BNE - JMP .Loop .Label6 %ReadByte() XBA %ReadByte() XBA REP #$20 LSR $8D LDX $8D BEQ .LoopEnd - STA ($00),y INY #2 DEX BNE - .LoopEnd SEP #$20 BCS + JMP .Loop + STA ($00),y INY JMP .Loop
Last edited on 2010-05-24 11:24:40 AM by 33953YoShI.
I just added in a little piece of code that stores the Y value (which contains the size of the decompressed ExGFX file) to $8D/$8E before the high byte gets destroyed by the SEP #$10 at $00B8EA.

This is useful for patches that need to upload arbitrary-sized ExGFX files without having to include a full copy of this routine in their code just to get the decompressed size.

$8D/$8E is already overwritten many times in the decompression routine and doesn't contain any useful information afterwards, so it's a good address to use.

EDIT: I'm also still leaning towards coding a GZIP decompression routine for SMW and implementing it into LM - not necessarily for faster decompression speeds, but because GZIP can compress to decently smaller sizes than LC_LZ2 in some cases.

GZIP is a pretty standardized compression format, and I've already found quite a few documents on it, I just haven't gotten around to coding a 65c816-version of the decompressor.

Code
HEADER LOROM !Freespace = $1D8000 org $80B8E3 JML Decomp_Start macro ReadByte() LDA [$8A] LDX $8A INX BNE + LDX.w #$8000 INC $8C + STX $8A endmacro org !Freespace Decomp: .Return PLB STY $8D ; store size to $8D JML $80B8EA .Start PHB LDA $02 PHA PLB .Loop %ReadByte() CMP #$FF BEQ .Return STA $8F AND #$E0 CMP #$E0 BEQ + PHA LDA $8F REP #$20 AND.w #$001F BRA .Label2 + LDA $8F ASL #3 AND #$E0 PHA LDA $8F AND #$03 XBA %ReadByte() REP #$20 .Label2 INC A STA $8D SEP #$20 PLA BEQ .Label3 BPL .NextUp .Label4 %ReadByte() XBA %ReadByte() TAX REP #$20 LSR $8D LDA $8D BEQ .LoopEnd0 - PHY TXY LDA ($00),y PLY STA ($00),y INY #2 INX #2 DEC $8D BNE - .LoopEnd0 SEP #$20 BCS + JMP .Loop + PHY TXY LDA ($00),y PLY STA ($00),y INY JMP .Loop .NextUp ASL A BPL .Label5 ASL A BPL .Label6 %ReadByte() LDX $8D - STA ($00),y INC A INY DEX BNE - JMP .Loop .Label3 %ReadByte() STA ($00),y INY LDX $8D DEX STX $8D BNE .Label3 JMP .Loop .Label5 %ReadByte() LDX $8D - STA ($00),y INY DEX BNE - JMP .Loop .Label6 %ReadByte() XBA %ReadByte() XBA REP #$20 LSR $8D LDX $8D BEQ .LoopEnd - STA ($00),y INY #2 DEX BNE - .LoopEnd SEP #$20 BCS + JMP .Loop + STA ($00),y INY JMP .Loop
Last edited on 2010-05-24 11:46:39 AM by edit1754.
Got rid of the indirect stuff because this is faster+you can use X so no expensive shuffling stuff in and out of Y. Your size thing should be preserved too, edit:

Code
HEADER LOROM !Freespace = $1D8000 org $80B8E3 JML Decomp_Start macro ReadByte() LDA [$8A] LDX $8A INX BNE + LDX.w #$8000 INC $8C + STX $8A endmacro org !Freespace Decomp: .Return PLB REP #$20 TYA SEC SBC $00 ;sub starting pointer STA $8D ; store size to $8D SEP #$20 JML $80B8EA .Start PHB LDA $02 PHA PLB LDY $00 ;16bit pointer in Y .Loop %ReadByte() CMP #$FF BEQ .Return STA $8F AND #$E0 CMP #$E0 BEQ + PHA LDA $8F REP #$20 AND.w #$001F BRA .Label2 + LDA $8F ASL #3 AND #$E0 PHA LDA $8F AND #$03 XBA %ReadByte() REP #$20 .Label2 INC A STA $8D SEP #$20 PLA BEQ .Label3 BPL .NextUp .Label4 %ReadByte() XBA %ReadByte() REP #$21 ADC $00 ;X needs to be offset by original pointer TAX LSR $8D LDA $8D BEQ .LoopEnd0 - LDA $0000,x STA $0000,y INY #2 INX #2 DEC $8D BNE - .LoopEnd0 SEP #$20 BCS + JMP .Loop + LDA $0000,x STA $0000,y INY JMP .Loop .NextUp ASL A BPL .Label5 ASL A BPL .Label6 %ReadByte() LDX $8D - STA $0000,y INC A INY DEX BNE - JMP .Loop .Label3 %ReadByte() STA $0000,y INY LDX $8D DEX STX $8D BNE .Label3 JMP .Loop .Label5 %ReadByte() LDX $8D - STA $0000,y INY DEX BNE - JMP .Loop .Label6 %ReadByte() XBA %ReadByte() XBA REP #$20 LSR $8D LDX $8D BEQ .LoopEnd - STA $0000,y INY #2 DEX BNE - .LoopEnd SEP #$20 BCS + JMP .Loop + STA $0000,y INY JMP .Loop


will contribute more when i'm not so tired, pardon any stupid mistakes but they should be easy to spot and correct.
Last edited on 2010-05-24 12:08:02 PM by smkdan.
I optimized it a little.
If there is a bug, I am sorry.

Code
HEADER LOROM !Freespace = $1D8000 org $80B8E3 JML Decomp_Start macro ReadByte() LDA [$00],y INY BMI + LDY.w #$8000 INC $02 + endmacro macro ReadWord() LDA [$00],y INY #2 BMI + PHA TYA ORA #$8000 TAY SEP #$20 INC $02 REP #$20 PLA + endmacro org !Freespace Decomp: .Return PLY STY $00 LDA $02 STA $8C STA $8F PHB PLA STA $02 PLB REP #$20 TXA SEC SBC $00 ;sub starting pointer TXY STA $8D ; store size to $8D SEP #$20 JML $80B8EA .Start PHB LDA $02 PHA PLB LDX $00 ;16bit pointer in X PHX STZ $00 STZ $01 LDY $8A LDA $8C STA $02 .Loop LDA $7F8182 %ReadByte() CMP #$FF BEQ .Return STA $8F AND #$E0 CMP #$E0 BEQ + PHA LDA $8F REP #$20 AND.w #$001F BRA .Label2 + LDA $8F ASL #3 AND #$E0 PHA LDA $8F AND #$03 XBA %ReadByte() REP #$20 .Label2 INC A STA $8D SEP #$20 PLA BEQ .Label3 BPL .NextUp .Label4 %ReadByte() XBA %ReadByte() PHY REP #$21 ADC $03,s ;Y needs to be offset by original pointer TAY LSR $8D LDA $8D BEQ .LoopEnd0 - LDA $0000,y STA $0000,x INY #2 INX #2 DEC $8D BNE - .LoopEnd0 SEP #$20 BCS + PLY JMP .Loop + LDA $0000,y STA $0000,x INX PLY JMP .Loop .NextUp ASL A BPL .Label5 ASL A BPL .Label6 %ReadByte() PHY LDY $8D - STA $0000,x INC A INX DEY BNE - PLY JMP .Loop .Label3 REP #$20 LSR $8D LDA $8D BEQ .LoopEnd1 - %ReadWord() STA $0000,x INX #2 DEC $8D BNE - .LoopEnd1 SEP #$20 BCS + JMP .Loop + %ReadByte() STA $0000,x INX JMP .Loop .Label5 %ReadByte() PHY LDY $8D - STA $0000,x INX DEY BNE - PLY JMP .Loop .Label6 REP #$20 %ReadWord() LSR $8D PHY LDY $8D BEQ .LoopEnd - STA $0000,x INX #2 DEY BNE - .LoopEnd PLY SEP #$20 BCS + JMP .Loop + STA $0000,x INX JMP .Loop
Nice work guys! I've done a small extremely inaccurate calculation and it seems like the level loading time improved by 0.5 seconds.

I'm sure this can be optimized even more though so I'll probably find a way again.
Half a second is good. I may use this for the SMAS compression routine, seeing as I used the original SMB1 code anyway. =P
Over a million cycles saved is good, and it hasn't even been unrolled yet. There's still room for improvement the thread is still pretty new.

Just a small bug I noticed (but only happens on bank crossing):

Code
macro ReadWord() LDA [$00],y INY #2 BMI + PHA TYA ORA #$8000 TAY SEP #$20 INC $02 REP #$20 PLA + endmacro


If [$00] points to $12:FFFF for example it will read upper byte from $13:0000 instead of $13:8000.

Code
macro ReadWord() LDA [$00],y INY #2 BMI + PHP LDY #$8000 SEP #$20 ;is $03 used for anything? can save a cycle without the SEP INC $02 PLP BEQ + XBA ;it's the high byte that got affected SEP #$20 LDA [$00],Y XBA INY REP #$20 + endmacro


didn't test that so point out anything that doesn't look quite right, ofcourse =)

edit: since push/pull is actually slower than STA dp/LDA dp, some savings can be made by placing them in unused DP space for that routine.
Last edited on 2010-05-25 09:24:22 PM by smkdan.
Going through the current code carefully, I've noticed that the following pieces of code can be unrolled:

Code
- LDA $0000,y ;direct copy? STA $0000,x INY #2 INX #2 DEC $8D BNE -


Code
LDY $8D - STA $0000,x ;direct fill? INX DEY BNE -


Code
- STA $0000,x ;direct word fill? INX #2 DEY BNE -


I'd try to do this myself but seeing that I have school today...............
MVN test.

Code
header lorom !ofs = $8FF000 macro ReadByte() STX $8A LDA [$8A] INX BNE $03 JSR BANK_INC endmacro org $80B8E3 JSL !ofs RTS org !ofs PHB PEI ($03) PEI ($05) PEI ($07) PEI ($09) PEI ($0B) PEI ($8A) SEP #$20 REP #$10 LDA $02 PHA PLB STA $05 ; dest_bank INC STA $03 ; dest_bank [plus or minus] LDA #$54 STA $04 ; mvn LDA #$4C STA $07 ; jump LDA $8C STA $06 ; src_bank LDX.w #.back STX $08 LDY $00 ; dest_low LDX $8A ; src_low STZ $8A STZ $8B BRA .main .case_80_or_e0 BPL .lz LDA $8D CMP #$1F BNE .case_e0 PLX : STX $8A PLX : STX $0B PLX : STX $09 PLX : STX $07 PLX : STX $05 PLX : STX $03 SEP #$10 PLB RTL .case_e0 AND #$03 STA $8E EOR $8D ASL ASL ASL XBA %ReadByte() STA $8D XBA BRA .type .lz %ReadByte() XBA %ReadByte() STX $0B REP #$21 ADC $00 TAX LDA $8D SEP #$20 BIT $03 BPL + MVN $7F7F BRA ++ + MVN $7E7E ++ LDX $0B .main %ReadByte() STA $8D STZ $8E AND #$E0 TRB $8D .type ASL BCS .case_80_or_e0 BMI .case_40_or_60 ASL BMI .case_20 .case_00 REP #$20 LDA $8D STX $8D - SEP #$20 JMP $0004 .back CPX $8D BCS .main JSR BANK_INC_2 CPX #$0000 BEQ ++ DEX STX $0B REP #$21 LDX #$8000 STX $8D TYA SBC $0B TAY LDA $0B BRA - ++ LDX #$8000 BRA .main .case_20 %ReadByte() STX $0B PHA PHA REP #$20 .case_20_main LDA $8D INC LSR TAX PLA - STA $0000,Y INY INY DEX BNE - SEP #$20 BCC + STA $0000,Y INY + LDX $0B BRA .main .case_40_or_60 ASL BMI .case_60 %ReadByte() XBA %ReadByte() XBA STX $0B REP #$20 PHA BRA .case_20_main .case_60 %ReadByte() STX $0B LDX $8D - STA $0000,Y INC INY DEX BPL - LDX $0B JMP .main BANK_INC: LDX #$8000 BANK_INC_2: INC $06 INC $8C RTS
@ersanio: the unrollable stuff will be good to convert to DMA, except word fill that would be pretty awkward. The copy/fill(byte) loops are suitable though.

@Min: Self modifying code with MVN is better than load/store, but DMA WRAM->SRAM->WRAM is 2 cycle/byte not counting setup. A few days ago there was a chat about just using DMA (would also work for byte copy).

@ersanio/others: For byte fill you can just use DMA and set DMA to not increment the address so it reads the same byte every time. 1 cycle per byte transferred like that to $2180. Just have to store it to some byte in SRAM because WRAM->WRAM transfer not allowed.

Copy can do $2180->$70xxxx->$2180 but some SRAM must be reserved. Much faster, but more akward to use. MVN is easy so it depends on what we decide. If we ever have to copy like 200 bytes or something (large monocolored area, repeating tile sequence etc) then DMA will be massively faster than MVN. For small quantities, MVN is OK due to DMA setup time.

By the end when everyone has put in their contribution it should be much, much faster than the original.

@Japan guys: if there is anyone else to bring please bring them since the communities are kind of divided for whatever reason. Even if the English is not too good it is still very easy to contribute to projects like these.
Last edited on 2010-05-26 08:19:45 AM by smkdan.
Re-added the size to $8D thing. I had to move one block of code to prevent it from causing out-of-range branches.
It will be very helpful if this stays in there

also added RATS tag, and insert-size counter (current size is 351 bytes)

Code
header lorom !ofs = $8FF000 macro ReadByte() STX $8A LDA [$8A] INX BNE $03 JSR BANK_INC endmacro org $80B8E3 JSL CodeStart RTS org !ofs reset bytes db "STAR" dw CodeEnd-CodeStart-$01 dw CodeEnd-CodeStart-$01^$FFFF CodeStart: PHB PEI ($03) PEI ($05) PEI ($07) PEI ($09) PEI ($0B) PEI ($8A) SEP #$20 REP #$10 LDA $02 PHA PLB STA $05 ; dest_bank INC STA $03 ; dest_bank [plus or minus] LDA #$54 STA $04 ; mvn LDA #$4C STA $07 ; jump LDA $8C STA $06 ; src_bank LDX.w #.back STX $08 LDY $00 ; dest_low LDX $8A ; src_low STZ $8A STZ $8B BRA .main .case_e0 AND #$03 STA $8E EOR $8D ASL ASL ASL XBA %ReadByte() STA $8D XBA BRA .type .case_80_or_e0 BPL .lz LDA $8D CMP #$1F BNE .case_e0 PLX : STX $8A PLX : STX $0B PLX : STX $09 PLX : STX $07 PLX : STX $05 PLX : STX $03 REP #$20 TYA SEC SBC $00 STA $8D ; size!!! SEP #$30 PLB RTL .lz %ReadByte() XBA %ReadByte() STX $0B REP #$21 ADC $00 TAX LDA $8D SEP #$20 BIT $03 BPL + MVN $7F7F BRA ++ + MVN $7E7E ++ LDX $0B .main %ReadByte() STA $8D STZ $8E AND #$E0 TRB $8D .type ASL BCS .case_80_or_e0 BMI .case_40_or_60 ASL BMI .case_20 .case_00 REP #$20 LDA $8D STX $8D - SEP #$20 JMP $0004 .back CPX $8D BCS .main JSR BANK_INC_2 CPX #$0000 BEQ ++ DEX STX $0B REP #$21 LDX #$8000 STX $8D TYA SBC $0B TAY LDA $0B BRA - ++ LDX #$8000 BRA .main .case_20 %ReadByte() STX $0B PHA PHA REP #$20 .case_20_main LDA $8D INC LSR TAX PLA - STA $0000,Y INY INY DEX BNE - SEP #$20 BCC + STA $0000,Y INY + LDX $0B BRA .main .case_40_or_60 ASL BMI .case_60 %ReadByte() XBA %ReadByte() XBA STX $0B REP #$20 PHA BRA .case_20_main .case_60 %ReadByte() STX $0B LDX $8D - STA $0000,Y INC INY DEX BPL - LDX $0B JMP .main BANK_INC: LDX #$8000 BANK_INC_2: INC $06 INC $8C RTS CodeEnd: print "Insert Size: ",bytes," bytes"
Last edited on 2010-05-26 12:59:44 PM by edit1754.
Eliminated unnecessary JSRs. The code is about 0.7 seconds faster now o_o
Code
HEADER LOROM !Freespace = $1D8000|$800000 macro ReadByte() STX $8A LDA [$8A] INX BNE + LDX #$8000 INC $06 INC $8C + endmacro org $80B8E3 JSL CodeStart ;was JML before RTS org !Freespace reset bytes db "STAR" dw CodeEnd-CodeStart-$01 dw CodeEnd-CodeStart-$01^$FFFF CodeStart: PHB PEI ($03) PEI ($05) PEI ($07) PEI ($09) PEI ($0B) PEI ($8A) SEP #$20 REP #$10 LDA $02 PHA PLB STA $05 ; dest_bank INC STA $03 ; dest_bank [plus or minus] LDA #$54 STA $04 ; mvn LDA #$4C STA $07 ; jump LDA $8C STA $06 ; src_bank LDX.w #.back STX $08 LDY $00 ; dest_low LDX $8A ; src_low STZ $8A STZ $8B BRA .main .case_e0 AND #$03 STA $8E EOR $8D ASL ASL ASL XBA %ReadByte() STA $8D XBA BRA .type .case_80_or_e0 BPL .lz LDA $8D CMP #$1F BNE .case_e0 PLX : STX $8A PLX : STX $0B PLX : STX $09 PLX : STX $07 PLX : STX $05 PLX : STX $03 REP #$20 TYA SEC SBC $00 STA $8D ; size!!! SEP #$30 PLB RTL ;JML $80B8EA .lz %ReadByte() XBA %ReadByte() STX $0B REP #$21 ADC $00 TAX LDA $8D SEP #$20 BIT $03 BPL + MVN $7F7F BRA ++ + MVN $7E7E ++ LDX $0B .main %ReadByte() STA $8D STZ $8E AND #$E0 TRB $8D .type ASL BCS .case_80_or_e0 BMI .case_40_or_60 ASL BMI .case_20 .case_00 REP #$20 LDA $8D STX $8D - SEP #$20 JMP $0004 .back CPX $8D BCS .main INC $06 INC $8C CPX #$0000 BEQ ++ DEX STX $0B REP #$21 LDX #$8000 STX $8D TYA SBC $0B TAY LDA $0B BRA - ++ LDX #$8000 BRA .main .case_20 %ReadByte() STX $0B PHA PHA REP #$20 .case_20_main LDA $8D INC LSR TAX PLA - STA $0000,Y INY INY DEX BNE - SEP #$20 BCC + STA $0000,Y INY + LDX $0B BRA .main .case_40_or_60 ASL BMI .case_60 %ReadByte() XBA %ReadByte() XBA STX $0B REP #$20 PHA BRA .case_20_main .case_60 %ReadByte() STX $0B LDX $8D - STA $0000,Y INC INY DEX BPL - LDX $0B JMP .main CodeEnd: print "Insert Size: ",bytes," bytes"
Last edited on 2010-05-27 03:16:13 AM by Ersanio.
Seeing as the JSL/RTL only gets called about 20 times max per level, you only lose about 1 microsecond give or take. Also, because there is potential for a hacker to need the decompression routine for whatever reason, I argue that it should stay as an RTL. I hate to keep going back to SMAS, but I do intend on using this routine in more than one spot. Point is, it can happen in SMW too, so I'd say the beneficiary comfort of keeping an RTL for convenience outweighs the additional microsecond we save in time.

Just my opinion. Anyone can rebut.
I would just have the JML as default and add comments for JSL conversion of the hijack.
Originally posted by spel werdz rite
Seeing as the JSL/RTL only gets called about 20 times max per level, you only lose about 1 microsecond give or take. Also, because there is potential for a hacker to need the decompression routine for whatever reason, I argue that it should stay as an RTL.


Agreeing with this. It's literally impossible to notice the difference and just causes inconvenience, so there's little reason to keep it. The focus should be on optimizing the loop content.
@smkdan and SWR: you both have good points actually.

I'll change them to JSL/RTL again.
@smkdan: Thanks.

I rewrote the code.

Code
lorom header !Freespace = $1D8000|$800000 macro ReadByte() STX $8A LDA [$8A] INX BNE + JSR BANK_INC + endmacro macro ReadWord() STX $8A LDA [$8A] INX INX BMI + JSR BANK_INC_2 + endmacro org $80B8E3 JSL CodeStart RTS org !Freespace reset bytes db "STAR" dw CodeEnd-CodeStart-$0001 dw CodeEnd-CodeStart-$0001^$FFFF CodeStart: PHB PEI ($03) PEI ($05) PEI ($07) PEI ($09) PEI ($0B) PEI ($8A) SEP #$20 REP #$10 LDA $02 PHA PLB STA $05 ; dest_bank INC STA $03 ; dest_bank [plus or minus] LDA #$54 STA $04 ; mvn LDA #$4C STA $07 ; jump LDA $8C STA $06 ; src_bank LDX.w #.back STX $08 LDY $00 ; dest_low LDX $8A ; src_low STZ $8A STZ $8B JMP .main .case_ff PLX : STX $8A PLX : STX $0B PLX : STX $09 PLX : STX $07 PLX : STX $05 PLX : STX $03 REP #$20 TYA SBC $00 ; carry = 1 STA $8D ; size SEP #$30 PLB RTL .case_e0 LDA $8D CMP #$1F BEQ .case_ff AND #$03 STA $8E EOR $8D ASL ASL ASL XBA %ReadByte() STA $8D XBA BRA .type .case_00 LDA $8E XBA LDA $8D - JMP $0004 .back CPX #$0000 BMI .main INC $06 INC $8C DEX BMI ++ STX $0B REP #$21 LDX #$8000 TYA SBC $0B TAY LDA $0B SEP #$20 BRA - ++ LDX #$8000 BRA .main .case_80_or_e0 BMI .case_e0 REP #$21 %ReadWord() XBA STX $8A ADC $00 TAX LDA $8D SEP #$20 BIT $03 BPL + MVN $7F7F BRA ++ + MVN $7E7E ++ LDX $8A BRA .bra .main STX $8A .bra LDA [$8A] INX BNE + JSR BANK_INC + STA $8D STZ $8E AND #$E0 TRB $8D .type ASL BCS .case_80_or_e0 BEQ .case_00 BMI .case_40_or_60 .case_20 %ReadByte() STX $0B PHA PHA REP #$20 .case_20_main LDA $8D INC LSR TAX PLA - STA $0000,Y INY INY DEX BNE - SEP #$20 LDX $0B BCC .main STA $0000,Y INY BRA .main .case_40_or_60 ASL BMI .case_60 REP #$20 %ReadWord() STX $0B PHA BRA .case_20_main .case_60 %ReadByte() STX $0B LDX $8D - STA $0000,Y INC INY DEX BPL - LDX $0B JMP .main BANK_INC: LDX #$8000 INC $06 INC $8C RTS BANK_INC_2: CPX #$0001 LDX #$8000 INC $06 ; $07($8D) is not affected. INC $8C BCC + SEP #$20 XBA STX $8A LDA [$8A] XBA INX REP #$21 + RTS CodeEnd: print "Insert Size: ",bytes," bytes"


DMA ver. (not use SRAM)
// updated

Code
lorom header !Freespace = $1D8000|$800000 ; Note: ; - This routine uses $00:211B-$00:211C, ; $00:2134-$00:2136, $00:435x-$00:436x. ; ; - This routine doesn't use WRAM->SRAM->WRAM DMA, ; because it will fail in some cases. ; ; Error case: ; CompData: <02:01 02 03> <85:00 00> <FF> ; dest: $7F0000 ; ; <02:01 02 03> ; 01 02 03 .. .. .. .. .. .. ; 01 02 03 .. .. .. .. .. .. ; ; <85:00 00> ; 01 02 03 01 02 03 01 02 03 ; Default / MVN ; 01 02 03 01 02 03 .. .. .. ; (for example) ; WRAM->SRAM ($7F:0000-$7F:0005 -> $70:0000-$70:0005) ; SRAM->WRAM ($70:0000-$70:0005 -> $7F:0003-$7F:0008) ; ; - It is needed to change the value of D register ; from $4300 to $0000 at the beginning of NMI/IRQ. ; macro ReadByte() LDA [$67],Y INY BNE + JSR BANK_INC + endmacro macro ReadWord() LDA [$67],Y INY INY BMI + JSR BANK_INC_2 + endmacro org $80B8E3 JSL CodeStart RTS org !Freespace reset bytes db "STAR" dw CodeEnd-CodeStart-$0001 dw CodeEnd-CodeStart-$0001^$FFFF CodeStart: PHB PHK PLB PHD PEA $4300 PLD LDX #$3480 STX $50 ; dma_param LDX $0000 STX $58 ; dest_low[start] / (HDMA Table Address) LDA $0002 STA $54 ; dest_bank STA $2183 INC STA $57 ; [Plus or Minus] / (HDMA Indirect Bank) LDY #$8000 STY $60 ; dma_param LDY $008A STZ $67 ; src_low STZ $68 ; src_high LDA $008C STA $64 ; src_bank STA $69 ; src_bank JMP .main .end PLD REP #$20 TXA SBC $00 ; carry = 1 STA $8D ; size SEP #$30 PLB RTL .case_e0 LDA $65 CMP #$1F BEQ .end AND #$03 STA $66 EOR $65 ASL #3 XBA %ReadByte() STA $65 XBA BRA .type .case_00 REP #$21 INC $65 ; bytecount STX $2181 TXA ADC $65 TAX STY $62 - SEP #$20 LDA #$40 STA $420B LDY $62 BMI .main INC $64 INC $69 CPY #$0000 BEQ ++ REP #$20 STY $65 LDY #$8000 STY $62 TXA SBC $65 ; carry = 1 STA $2181 BRA - ++ LDY #$8000 BRA .main .case_80_or_e0 BMI .case_e0 REP #$21 %ReadWord() STY $52 ; tmp TXY XBA ADC $58 TAX LDA $65 SEP #$20 PHB BIT $57 BPL + MVN $7F7F BRA ++ + MVN $7E7E ++ TYX LDY $52 PLB .main %ReadByte() STA $65 STZ $66 AND #$E0 TRB $65 .type ASL BCS .case_80_or_e0 BEQ .case_00 BMI .case_40_or_60 .case_20 %ReadByte() STA $211B STZ $211B LDA #$80 STA $50 ; param STX $52 LDX $65 INX STX $55 LDA #$01 STA $211C LDA #$20 STA $420B LDX $52 BRA .main .case_40_or_60 ASL BMI .case_60 .case_40 REP #$20 %ReadWord() SEP #$20 STA $211B XBA STA $211B LDA #$81 STA $50 ; param STX $52 LDX $65 INX STX $55 LDA #$01 STA $211C LDA #$20 STA $420B LDX $52 BRA .main .case_60 %ReadByte() STX $2181 STY $52 ; tmp LDY $65 - STA $2180 INC INX DEY BPL - LDY $52 JMP .main BANK_INC: LDY #$8000 INC $64 INC $69 RTS BANK_INC_2: CPY #$0001 LDY #$8000 INC $64 INC $69 BCC + SEP #$20 XBA LDA [$67],Y INY XBA REP #$21 + RTS CodeEnd: print "Insert Size: ",bytes," bytes"
Last edited on 2010-05-27 06:54:55 AM by Min.
for some reason the DMA version is screwing up the new (yet-to-be-released) Layer3ExGFX that reloads GFX on submap change (including FG slots) w/o fblank. It causes layer 1 and sprites to "flicker" above the windowing effects. I think it might have something to do with the DMA transfer that uses channel 6, because when I comment out the STA $420B it stops flickering (of course messes up the GFX load too though), but when I change the channel it doesn't help at all.

I'm looking into a solution right now, because I really think this should be addressed.


EDIT: I just tested it in SNES9x, and the problem isn't exactly the same but it is there. the whole screen flickers black as if entering fblank every other frame.
I guess the routine has a problem running outside of blank, which is a problem for any "OW ExGraFix" patches that reload GFX on submap change without fblank

EDIT2: BSNES does the same

EDIT3: I'm not even sure if the DMA version is faster. And if it is slightly faster, I think we should continue to use the non-DMA version because compatibility > tiny speed increases.
Last edited on 2010-05-27 09:14:50 AM by edit1754.
@edit: It writes to CH5/CH6 registers, any possibility CH5 is messing things up for whatever you are doing? The registers themselves are readable so maybe push them before entering the routine and see if it helps.
Pages: « 1 2 »
Forum Index - SMW Hacking - General SMW Hacking Help - ASM & Related Topics - Optimize the LC_LZ2 decompression!

The purpose of this site is not to distribute copyrighted material, but to honor one of our favourite games.

Copyright © 2005 - 2013 - SMW Central
Legal Information - Link To Us


Total queries: 29

Menu