 |
|
 |
|
| Optimize the LC_LZ2 decompression! |
|
Forum Index - SMW Hacking - General SMW Hacking Help - ASM & Related Topics - Optimize the LC_LZ2 decompression! |
|
Pages: 1 2  |
|
|
|
| Posted on 2010-05-24 05:28:31 AM |
Link | Quote |
|
Lately I've been attempting to optimize the LC_LZ2 decompression routine of SMW. It went perfectly fine until I hit my limit.
Why I'm optimizing this? To decrease the level loading time even if it is for a split-second. My code currently looks as the following:
CodeHEADER
LOROM
!Freespace = $1D8000
ORG $00B8E3
JML Decomp_start
macro ReadByte()
LDA [$8A]
LDX $8A
INX
BNE +
LDX.w #$8000
INC $8C
+ STX $8A
endmacro
ORG !Freespace
Decomp:
.return JML $00B8EA
.start %ReadByte()
CMP.b #$FF
BEQ .return
STA $8F
AND.b #$E0
CMP.b #$E0
BEQ +
PHA
LDA $8F
REP #$20
AND.w #$001F
BRA .label2
+ LDA $8F
ASL
ASL
ASL
AND.b #$E0
PHA
LDA $8F
AND.b #$03
XBA
%ReadByte()
REP #$20
.label2 INC A
STA $8D
SEP #$20
PLA
BEQ .label3
BPL .nextup
.label4 %ReadByte()
XBA
%ReadByte()
TAX
- PHY
TXY
LDA [$00],Y
PLY
STA [$00],Y
INY
INX
REP #$20
DEC $8D
SEP #$20
BNE -
JMP.w .start
.nextup ASL
BPL .label5
ASL
BPL .label6
%ReadByte()
LDX $8D
- STA [$00],Y
INC A
INY
DEX
BNE -
JMP.w .start
.label3 %ReadByte()
STA [$00],Y
INY
LDX $8D
DEX
STX $8D
BNE .label3
JMP .start
.label5 %ReadByte()
LDX $8D
- STA [$00],Y
INY
DEX
BNE -
JMP .start
.label6 %ReadByte()
XBA
%ReadByte()
LDX $8D
- XBA
STA [$00],Y
INY
DEX
BEQ +
XBA
STA [$00],Y
INY
DEX
BNE -
+ JMP .start
I'm pretty sure that this can be optimized even more. But seeing that I hit my limit, I have no idea how. I'd prefer advanced ASM hackers to contribute to this optimization. But other people can try too.
With optimizing I mean faster code, even if it costs ROM space. This means that the code should use minimal amount of cycles. It's all for the sake of decreasing level loading times and what not. It would be awesome if we actually saw some visible faster loading time.
You can find a list of cycles in this document.
|
| Last edited on 2010-05-24 05:34:45 AM by Ersanio. |
|
| Posted on 2010-05-24 10:51:57 AM |
Link | Quote |
|
I'm sorry. I can't speak English.
So I put only a code.
Really I'm sorry.
Code
HEADER
LOROM
!Freespace = $1D8000
org $80B8E3
JML Decomp_Start
macro ReadByte()
LDA [$8A]
LDX $8A
INX
BNE +
LDX.w #$8000
INC $8C
+ STX $8A
endmacro
org !Freespace
Decomp:
.Return PLB
JML $80B8EA
.Start PHB
LDA $02
PHA
PLB
.Loop %ReadByte()
CMP #$FF
BEQ .Return
STA $8F
AND #$E0
CMP #$E0
BEQ +
PHA
LDA $8F
REP #$20
AND.w #$001F
BRA .Label2
+ LDA $8F
ASL #3
AND #$E0
PHA
LDA $8F
AND #$03
XBA
%ReadByte()
REP #$20
.Label2 INC A
STA $8D
SEP #$20
PLA
BEQ .Label3
BPL .NextUp
.Label4 %ReadByte()
XBA
%ReadByte()
TAX
REP #$20
LSR $8D
LDA $8D
BEQ .LoopEnd0
- PHY
TXY
LDA ($00),y
PLY
STA ($00),y
INY #2
INX #2
DEC $8D
BNE -
.LoopEnd0 SEP #$20
BCS +
JMP .Loop
+ PHY
TXY
LDA ($00),y
PLY
STA ($00),y
INY
JMP .Loop
.NextUp ASL A
BPL .Label5
ASL A
BPL .Label6
%ReadByte()
LDX $8D
- STA ($00),y
INC A
INY
DEX
BNE -
JMP .Loop
.Label3 %ReadByte()
STA ($00),y
INY
LDX $8D
DEX
STX $8D
BNE .Label3
JMP .Loop
.Label5 %ReadByte()
LDX $8D
- STA ($00),y
INY
DEX
BNE -
JMP .Loop
.Label6 %ReadByte()
XBA
%ReadByte()
XBA
REP #$20
LSR $8D
LDX $8D
BEQ .LoopEnd
- STA ($00),y
INY #2
DEX
BNE -
.LoopEnd SEP #$20
BCS +
JMP .Loop
+ STA ($00),y
INY
JMP .Loop
|
| Last edited on 2010-05-24 11:24:40 AM by 33953YoShI. |
|
| Posted on 2010-05-24 11:36:01 AM |
Link | Quote |
|
I just added in a little piece of code that stores the Y value (which contains the size of the decompressed ExGFX file) to $8D/$8E before the high byte gets destroyed by the SEP #$10 at $00B8EA.
This is useful for patches that need to upload arbitrary-sized ExGFX files without having to include a full copy of this routine in their code just to get the decompressed size.
$8D/$8E is already overwritten many times in the decompression routine and doesn't contain any useful information afterwards, so it's a good address to use.
EDIT: I'm also still leaning towards coding a GZIP decompression routine for SMW and implementing it into LM - not necessarily for faster decompression speeds, but because GZIP can compress to decently smaller sizes than LC_LZ2 in some cases.
GZIP is a pretty standardized compression format, and I've already found quite a few documents on it, I just haven't gotten around to coding a 65c816-version of the decompressor.
Code
HEADER
LOROM
!Freespace = $1D8000
org $80B8E3
JML Decomp_Start
macro ReadByte()
LDA [$8A]
LDX $8A
INX
BNE +
LDX.w #$8000
INC $8C
+ STX $8A
endmacro
org !Freespace
Decomp:
.Return PLB
STY $8D ; store size to $8D
JML $80B8EA
.Start PHB
LDA $02
PHA
PLB
.Loop %ReadByte()
CMP #$FF
BEQ .Return
STA $8F
AND #$E0
CMP #$E0
BEQ +
PHA
LDA $8F
REP #$20
AND.w #$001F
BRA .Label2
+ LDA $8F
ASL #3
AND #$E0
PHA
LDA $8F
AND #$03
XBA
%ReadByte()
REP #$20
.Label2 INC A
STA $8D
SEP #$20
PLA
BEQ .Label3
BPL .NextUp
.Label4 %ReadByte()
XBA
%ReadByte()
TAX
REP #$20
LSR $8D
LDA $8D
BEQ .LoopEnd0
- PHY
TXY
LDA ($00),y
PLY
STA ($00),y
INY #2
INX #2
DEC $8D
BNE -
.LoopEnd0 SEP #$20
BCS +
JMP .Loop
+ PHY
TXY
LDA ($00),y
PLY
STA ($00),y
INY
JMP .Loop
.NextUp ASL A
BPL .Label5
ASL A
BPL .Label6
%ReadByte()
LDX $8D
- STA ($00),y
INC A
INY
DEX
BNE -
JMP .Loop
.Label3 %ReadByte()
STA ($00),y
INY
LDX $8D
DEX
STX $8D
BNE .Label3
JMP .Loop
.Label5 %ReadByte()
LDX $8D
- STA ($00),y
INY
DEX
BNE -
JMP .Loop
.Label6 %ReadByte()
XBA
%ReadByte()
XBA
REP #$20
LSR $8D
LDX $8D
BEQ .LoopEnd
- STA ($00),y
INY #2
DEX
BNE -
.LoopEnd SEP #$20
BCS +
JMP .Loop
+ STA ($00),y
INY
JMP .Loop
|
| Last edited on 2010-05-24 11:46:39 AM by edit1754. |
|
| Posted on 2010-05-24 12:07:13 PM |
Link | Quote |
|
Got rid of the indirect stuff because this is faster+you can use X so no expensive shuffling stuff in and out of Y. Your size thing should be preserved too, edit:
CodeHEADER
LOROM
!Freespace = $1D8000
org $80B8E3
JML Decomp_Start
macro ReadByte()
LDA [$8A]
LDX $8A
INX
BNE +
LDX.w #$8000
INC $8C
+ STX $8A
endmacro
org !Freespace
Decomp:
.Return PLB
REP #$20
TYA
SEC
SBC $00 ;sub starting pointer
STA $8D ; store size to $8D
SEP #$20
JML $80B8EA
.Start PHB
LDA $02
PHA
PLB
LDY $00 ;16bit pointer in Y
.Loop %ReadByte()
CMP #$FF
BEQ .Return
STA $8F
AND #$E0
CMP #$E0
BEQ +
PHA
LDA $8F
REP #$20
AND.w #$001F
BRA .Label2
+ LDA $8F
ASL #3
AND #$E0
PHA
LDA $8F
AND #$03
XBA
%ReadByte()
REP #$20
.Label2 INC A
STA $8D
SEP #$20
PLA
BEQ .Label3
BPL .NextUp
.Label4 %ReadByte()
XBA
%ReadByte()
REP #$21
ADC $00 ;X needs to be offset by original pointer
TAX
LSR $8D
LDA $8D
BEQ .LoopEnd0
- LDA $0000,x
STA $0000,y
INY #2
INX #2
DEC $8D
BNE -
.LoopEnd0 SEP #$20
BCS +
JMP .Loop
+ LDA $0000,x
STA $0000,y
INY
JMP .Loop
.NextUp ASL A
BPL .Label5
ASL A
BPL .Label6
%ReadByte()
LDX $8D
- STA $0000,y
INC A
INY
DEX
BNE -
JMP .Loop
.Label3 %ReadByte()
STA $0000,y
INY
LDX $8D
DEX
STX $8D
BNE .Label3
JMP .Loop
.Label5 %ReadByte()
LDX $8D
- STA $0000,y
INY
DEX
BNE -
JMP .Loop
.Label6 %ReadByte()
XBA
%ReadByte()
XBA
REP #$20
LSR $8D
LDX $8D
BEQ .LoopEnd
- STA $0000,y
INY #2
DEX
BNE -
.LoopEnd SEP #$20
BCS +
JMP .Loop
+ STA $0000,y
INY
JMP .Loop
will contribute more when i'm not so tired, pardon any stupid mistakes but they should be easy to spot and correct.
|
| Last edited on 2010-05-24 12:08:02 PM by smkdan. |
|
| Posted on 2010-05-25 01:12:42 PM |
Link | Quote |
|
I optimized it a little.
If there is a bug, I am sorry.
Code
HEADER
LOROM
!Freespace = $1D8000
org $80B8E3
JML Decomp_Start
macro ReadByte()
LDA [$00],y
INY
BMI +
LDY.w #$8000
INC $02
+
endmacro
macro ReadWord()
LDA [$00],y
INY #2
BMI +
PHA
TYA
ORA #$8000
TAY
SEP #$20
INC $02
REP #$20
PLA
+
endmacro
org !Freespace
Decomp:
.Return PLY
STY $00
LDA $02
STA $8C
STA $8F
PHB
PLA
STA $02
PLB
REP #$20
TXA
SEC
SBC $00 ;sub starting pointer
TXY
STA $8D ; store size to $8D
SEP #$20
JML $80B8EA
.Start PHB
LDA $02
PHA
PLB
LDX $00 ;16bit pointer in X
PHX
STZ $00
STZ $01
LDY $8A
LDA $8C
STA $02
.Loop LDA $7F8182
%ReadByte()
CMP #$FF
BEQ .Return
STA $8F
AND #$E0
CMP #$E0
BEQ +
PHA
LDA $8F
REP #$20
AND.w #$001F
BRA .Label2
+ LDA $8F
ASL #3
AND #$E0
PHA
LDA $8F
AND #$03
XBA
%ReadByte()
REP #$20
.Label2 INC A
STA $8D
SEP #$20
PLA
BEQ .Label3
BPL .NextUp
.Label4 %ReadByte()
XBA
%ReadByte()
PHY
REP #$21
ADC $03,s ;Y needs to be offset by original pointer
TAY
LSR $8D
LDA $8D
BEQ .LoopEnd0
- LDA $0000,y
STA $0000,x
INY #2
INX #2
DEC $8D
BNE -
.LoopEnd0 SEP #$20
BCS +
PLY
JMP .Loop
+ LDA $0000,y
STA $0000,x
INX
PLY
JMP .Loop
.NextUp ASL A
BPL .Label5
ASL A
BPL .Label6
%ReadByte()
PHY
LDY $8D
- STA $0000,x
INC A
INX
DEY
BNE -
PLY
JMP .Loop
.Label3 REP #$20
LSR $8D
LDA $8D
BEQ .LoopEnd1
- %ReadWord()
STA $0000,x
INX #2
DEC $8D
BNE -
.LoopEnd1 SEP #$20
BCS +
JMP .Loop
+ %ReadByte()
STA $0000,x
INX
JMP .Loop
.Label5 %ReadByte()
PHY
LDY $8D
- STA $0000,x
INX
DEY
BNE -
PLY
JMP .Loop
.Label6 REP #$20
%ReadWord()
LSR $8D
PHY
LDY $8D
BEQ .LoopEnd
- STA $0000,x
INX #2
DEY
BNE -
.LoopEnd PLY
SEP #$20
BCS +
JMP .Loop
+ STA $0000,x
INX
JMP .Loop
|
|
| Posted on 2010-05-25 01:41:30 PM |
Link | Quote |
|
Nice work guys! I've done a small extremely inaccurate calculation and it seems like the level loading time improved by 0.5 seconds.
I'm sure this can be optimized even more though so I'll probably find a way again.
|
|
| Posted on 2010-05-25 06:27:40 PM |
Link | Quote |
|
|
Half a second is good. I may use this for the SMAS compression routine, seeing as I used the original SMB1 code anyway. =P
|
|
| Posted on 2010-05-25 09:15:51 PM |
Link | Quote |
|
Over a million cycles saved is good, and it hasn't even been unrolled yet. There's still room for improvement the thread is still pretty new.
Just a small bug I noticed (but only happens on bank crossing):
Codemacro ReadWord()
LDA [$00],y
INY #2
BMI +
PHA
TYA
ORA
#$8000
TAY
SEP #$20
INC $02
REP #$20
PLA
+
endmacro
If [$00] points to $12:FFFF for example it will read upper byte from $13:0000 instead of $13:8000.
Codemacro ReadWord()
LDA [$00],y
INY #2
BMI +
PHP
LDY #$8000
SEP #$20 ;is $03 used for anything? can save a cycle without the SEP
INC $02
PLP
BEQ +
XBA ;it's the high byte that got affected
SEP #$20
LDA [$00],Y
XBA
INY
REP #$20
+
endmacro
didn't test that so point out anything that doesn't look quite right, ofcourse =)
edit: since push/pull is actually slower than STA dp/LDA dp, some savings can be made by placing them in unused DP space for that routine.
|
| Last edited on 2010-05-25 09:24:22 PM by smkdan. |
|
| Posted on 2010-05-26 04:39:32 AM |
Link | Quote |
|
Going through the current code carefully, I've noticed that the following pieces of code can be unrolled:
Code- LDA $0000,y ;direct copy?
STA $0000,x
INY #2
INX #2
DEC $8D
BNE -
Code LDY $8D
- STA $0000,x ;direct fill?
INX
DEY
BNE -
Code- STA $0000,x ;direct word fill?
INX #2
DEY
BNE -
I'd try to do this myself but seeing that I have school today...............
|
|
| Posted on 2010-05-26 07:19:57 AM |
Link | Quote |
|
MVN test.
Codeheader
lorom
!ofs = $8FF000
macro ReadByte()
STX $8A
LDA [$8A]
INX
BNE $03
JSR BANK_INC
endmacro
org $80B8E3
JSL !ofs
RTS
org !ofs
PHB
PEI ($03)
PEI ($05)
PEI ($07)
PEI ($09)
PEI ($0B)
PEI ($8A)
SEP #$20
REP #$10
LDA $02
PHA
PLB
STA $05 ; dest_bank
INC
STA $03 ; dest_bank [plus or minus]
LDA #$54
STA $04 ; mvn
LDA #$4C
STA $07 ; jump
LDA $8C
STA $06 ; src_bank
LDX.w #.back
STX $08
LDY $00 ; dest_low
LDX $8A ; src_low
STZ $8A
STZ $8B
BRA .main
.case_80_or_e0
BPL .lz
LDA $8D
CMP #$1F
BNE .case_e0
PLX : STX $8A
PLX : STX $0B
PLX : STX $09
PLX : STX $07
PLX : STX $05
PLX : STX $03
SEP #$10
PLB
RTL
.case_e0
AND #$03
STA $8E
EOR $8D
ASL
ASL
ASL
XBA
%ReadByte()
STA $8D
XBA
BRA .type
.lz
%ReadByte()
XBA
%ReadByte()
STX $0B
REP #$21
ADC $00
TAX
LDA $8D
SEP #$20
BIT $03
BPL +
MVN $7F7F
BRA ++
+ MVN $7E7E
++ LDX $0B
.main
%ReadByte()
STA $8D
STZ $8E
AND #$E0
TRB $8D
.type
ASL
BCS .case_80_or_e0
BMI .case_40_or_60
ASL
BMI .case_20
.case_00
REP #$20
LDA $8D
STX $8D
- SEP #$20
JMP $0004
.back
CPX $8D
BCS .main
JSR BANK_INC_2
CPX #$0000
BEQ ++
DEX
STX $0B
REP #$21
LDX #$8000
STX $8D
TYA
SBC $0B
TAY
LDA $0B
BRA -
++ LDX #$8000
BRA .main
.case_20
%ReadByte()
STX $0B
PHA
PHA
REP #$20
.case_20_main
LDA $8D
INC
LSR
TAX
PLA
- STA $0000,Y
INY
INY
DEX
BNE -
SEP #$20
BCC +
STA $0000,Y
INY
+ LDX $0B
BRA .main
.case_40_or_60
ASL
BMI .case_60
%ReadByte()
XBA
%ReadByte()
XBA
STX $0B
REP #$20
PHA
BRA .case_20_main
.case_60
%ReadByte()
STX $0B
LDX $8D
- STA $0000,Y
INC
INY
DEX
BPL -
LDX $0B
JMP .main
BANK_INC:
LDX #$8000
BANK_INC_2:
INC $06
INC $8C
RTS
|
|
| Posted on 2010-05-26 08:18:58 AM |
Link | Quote |
|
@ersanio: the unrollable stuff will be good to convert to DMA, except word fill that would be pretty awkward. The copy/fill(byte) loops are suitable though.
@Min: Self modifying code with MVN is better than load/store, but DMA WRAM->SRAM->WRAM is 2 cycle/byte not counting setup. A few days ago there was a chat about just using DMA (would also work for byte copy).
@ersanio/others: For byte fill you can just use DMA and set DMA to not increment the address so it reads the same byte every time. 1 cycle per byte transferred like that to $2180. Just have to store it to some byte in SRAM because WRAM->WRAM transfer not allowed.
Copy can do $2180->$70xxxx->$2180 but some SRAM must be reserved. Much faster, but more akward to use. MVN is easy so it depends on what we decide. If we ever have to copy like 200 bytes or something (large monocolored area, repeating tile sequence etc) then DMA will be massively faster than MVN. For small quantities, MVN is OK due to DMA setup time.
By the end when everyone has put in their contribution it should be much, much faster than the original.
@Japan guys: if there is anyone else to bring please bring them since the communities are kind of divided for whatever reason. Even if the English is not too good it is still very easy to contribute to projects like these.
|
| Last edited on 2010-05-26 08:19:45 AM by smkdan. |
|
| Posted on 2010-05-26 12:58:13 PM |
Link | Quote |
|
Re-added the size to $8D thing. I had to move one block of code to prevent it from causing out-of-range branches.
It will be very helpful if this stays in there
also added RATS tag, and insert-size counter (current size is 351 bytes)
Codeheader
lorom
!ofs = $8FF000
macro ReadByte()
STX $8A
LDA [$8A]
INX
BNE $03
JSR BANK_INC
endmacro
org $80B8E3
JSL CodeStart
RTS
org !ofs
reset bytes
db "STAR"
dw CodeEnd-CodeStart-$01
dw CodeEnd-CodeStart-$01^$FFFF
CodeStart:
PHB
PEI ($03)
PEI ($05)
PEI ($07)
PEI ($09)
PEI ($0B)
PEI ($8A)
SEP #$20
REP #$10
LDA $02
PHA
PLB
STA $05 ; dest_bank
INC
STA $03 ; dest_bank [plus or minus]
LDA #$54
STA $04 ; mvn
LDA #$4C
STA $07 ; jump
LDA $8C
STA $06 ; src_bank
LDX.w #.back
STX $08
LDY $00 ; dest_low
LDX $8A ; src_low
STZ $8A
STZ $8B
BRA .main
.case_e0
AND #$03
STA $8E
EOR $8D
ASL
ASL
ASL
XBA
%ReadByte()
STA $8D
XBA
BRA .type
.case_80_or_e0
BPL .lz
LDA $8D
CMP #$1F
BNE .case_e0
PLX : STX $8A
PLX : STX $0B
PLX : STX $09
PLX : STX $07
PLX : STX $05
PLX : STX $03
REP #$20
TYA
SEC
SBC $00
STA $8D ; size!!!
SEP #$30
PLB
RTL
.lz
%ReadByte()
XBA
%ReadByte()
STX $0B
REP #$21
ADC $00
TAX
LDA $8D
SEP #$20
BIT $03
BPL +
MVN $7F7F
BRA ++
+ MVN $7E7E
++ LDX $0B
.main
%ReadByte()
STA $8D
STZ $8E
AND #$E0
TRB $8D
.type
ASL
BCS .case_80_or_e0
BMI .case_40_or_60
ASL
BMI .case_20
.case_00
REP #$20
LDA $8D
STX $8D
- SEP #$20
JMP $0004
.back
CPX $8D
BCS .main
JSR BANK_INC_2
CPX #$0000
BEQ ++
DEX
STX $0B
REP #$21
LDX #$8000
STX $8D
TYA
SBC $0B
TAY
LDA $0B
BRA -
++ LDX #$8000
BRA .main
.case_20
%ReadByte()
STX $0B
PHA
PHA
REP #$20
.case_20_main
LDA $8D
INC
LSR
TAX
PLA
- STA $0000,Y
INY
INY
DEX
BNE -
SEP #$20
BCC +
STA $0000,Y
INY
+ LDX $0B
BRA .main
.case_40_or_60
ASL
BMI .case_60
%ReadByte()
XBA
%ReadByte()
XBA
STX $0B
REP #$20
PHA
BRA .case_20_main
.case_60
%ReadByte()
STX $0B
LDX $8D
- STA $0000,Y
INC
INY
DEX
BPL -
LDX $0B
JMP .main
BANK_INC:
LDX #$8000
BANK_INC_2:
INC $06
INC $8C
RTS
CodeEnd:
print "Insert Size: ",bytes," bytes"
|
| Last edited on 2010-05-26 12:59:44 PM by edit1754. |
|
| Posted on 2010-05-26 02:25:23 PM |
Link | Quote |
|
Eliminated unnecessary JSRs. The code is about 0.7 seconds faster now o_o
CodeHEADER
LOROM
!Freespace = $1D8000|$800000
macro ReadByte()
STX $8A
LDA [$8A]
INX
BNE +
LDX #$8000
INC $06
INC $8C
+
endmacro
org $80B8E3
JSL CodeStart ;was JML before
RTS
org !Freespace
reset bytes
db "STAR"
dw CodeEnd-CodeStart-$01
dw CodeEnd-CodeStart-$01^$FFFF
CodeStart:
PHB
PEI ($03)
PEI ($05)
PEI ($07)
PEI ($09)
PEI ($0B)
PEI ($8A)
SEP #$20
REP #$10
LDA $02
PHA
PLB
STA $05 ; dest_bank
INC
STA $03 ; dest_bank [plus or minus]
LDA #$54
STA $04 ; mvn
LDA #$4C
STA $07 ; jump
LDA $8C
STA $06 ; src_bank
LDX.w #.back
STX $08
LDY $00 ; dest_low
LDX $8A ; src_low
STZ $8A
STZ $8B
BRA .main
.case_e0
AND #$03
STA $8E
EOR $8D
ASL
ASL
ASL
XBA
%ReadByte()
STA $8D
XBA
BRA .type
.case_80_or_e0
BPL .lz
LDA $8D
CMP #$1F
BNE .case_e0
PLX : STX $8A
PLX : STX $0B
PLX : STX $09
PLX : STX $07
PLX : STX $05
PLX : STX $03
REP #$20
TYA
SEC
SBC $00
STA $8D ; size!!!
SEP #$30
PLB
RTL ;JML $80B8EA
.lz
%ReadByte()
XBA
%ReadByte()
STX $0B
REP #$21
ADC $00
TAX
LDA $8D
SEP #$20
BIT $03
BPL +
MVN $7F7F
BRA ++
+ MVN $7E7E
++ LDX $0B
.main
%ReadByte()
STA $8D
STZ $8E
AND #$E0
TRB $8D
.type
ASL
BCS .case_80_or_e0
BMI .case_40_or_60
ASL
BMI .case_20
.case_00
REP #$20
LDA $8D
STX $8D
- SEP #$20
JMP $0004
.back
CPX $8D
BCS .main
INC $06
INC $8C
CPX #$0000
BEQ ++
DEX
STX $0B
REP #$21
LDX #$8000
STX $8D
TYA
SBC $0B
TAY
LDA $0B
BRA -
++ LDX #$8000
BRA .main
.case_20
%ReadByte()
STX $0B
PHA
PHA
REP #$20
.case_20_main
LDA $8D
INC
LSR
TAX
PLA
- STA $0000,Y
INY
INY
DEX
BNE -
SEP #$20
BCC +
STA $0000,Y
INY
+ LDX $0B
BRA .main
.case_40_or_60
ASL
BMI .case_60
%ReadByte()
XBA
%ReadByte()
XBA
STX $0B
REP #$20
PHA
BRA .case_20_main
.case_60
%ReadByte()
STX $0B
LDX $8D
- STA $0000,Y
INC
INY
DEX
BPL -
LDX $0B
JMP .main
CodeEnd:
print "Insert Size: ",bytes," bytes"
|
| Last edited on 2010-05-27 03:16:13 AM by Ersanio. |
|
| Posted on 2010-05-26 10:18:46 PM |
Link | Quote |
|
Seeing as the JSL/RTL only gets called about 20 times max per level, you only lose about 1 microsecond give or take. Also, because there is potential for a hacker to need the decompression routine for whatever reason, I argue that it should stay as an RTL. I hate to keep going back to SMAS, but I do intend on using this routine in more than one spot. Point is, it can happen in SMW too, so I'd say the beneficiary comfort of keeping an RTL for convenience outweighs the additional microsecond we save in time.
Just my opinion. Anyone can rebut.
|
|
| Posted on 2010-05-26 10:27:47 PM |
Link | Quote |
|
|
I would just have the JML as default and add comments for JSL conversion of the hijack.
|
|
| Posted on 2010-05-26 11:09:12 PM |
Link | Quote |
|
Originally posted by spel werdz riteSeeing as the JSL/RTL only gets called about 20 times max per level, you only lose about 1 microsecond give or take. Also, because there is potential for a hacker to need the decompression routine for whatever reason, I argue that it should stay as an RTL.
Agreeing with this. It's literally impossible to notice the difference and just causes inconvenience, so there's little reason to keep it. The focus should be on optimizing the loop content.
|
|
| Posted on 2010-05-27 03:14:50 AM |
Link | Quote |
|
@smkdan and SWR: you both have good points actually.
I'll change them to JSL/RTL again.
|
|
| Posted on 2010-05-27 05:21:03 AM |
Link | Quote |
|
@smkdan: Thanks.
I rewrote the code.
Code
lorom
header
!Freespace = $1D8000|$800000
macro ReadByte()
STX $8A
LDA [$8A]
INX
BNE +
JSR BANK_INC
+
endmacro
macro ReadWord()
STX $8A
LDA [$8A]
INX
INX
BMI +
JSR BANK_INC_2
+
endmacro
org $80B8E3
JSL CodeStart
RTS
org !Freespace
reset bytes
db "STAR"
dw CodeEnd-CodeStart-$0001
dw CodeEnd-CodeStart-$0001^$FFFF
CodeStart:
PHB
PEI ($03)
PEI ($05)
PEI ($07)
PEI ($09)
PEI ($0B)
PEI ($8A)
SEP #$20
REP #$10
LDA $02
PHA
PLB
STA $05 ; dest_bank
INC
STA $03 ; dest_bank [plus or minus]
LDA #$54
STA $04 ; mvn
LDA #$4C
STA $07 ; jump
LDA $8C
STA $06 ; src_bank
LDX.w #.back
STX $08
LDY $00 ; dest_low
LDX $8A ; src_low
STZ $8A
STZ $8B
JMP .main
.case_ff
PLX : STX $8A
PLX : STX $0B
PLX : STX $09
PLX : STX $07
PLX : STX $05
PLX : STX $03
REP #$20
TYA
SBC $00 ; carry = 1
STA $8D ; size
SEP #$30
PLB
RTL
.case_e0
LDA $8D
CMP #$1F
BEQ .case_ff
AND #$03
STA $8E
EOR $8D
ASL
ASL
ASL
XBA
%ReadByte()
STA $8D
XBA
BRA .type
.case_00
LDA $8E
XBA
LDA $8D
- JMP $0004
.back
CPX #$0000
BMI .main
INC $06
INC $8C
DEX
BMI ++
STX $0B
REP #$21
LDX #$8000
TYA
SBC $0B
TAY
LDA $0B
SEP #$20
BRA -
++ LDX #$8000
BRA .main
.case_80_or_e0
BMI .case_e0
REP #$21
%ReadWord()
XBA
STX $8A
ADC $00
TAX
LDA $8D
SEP #$20
BIT $03
BPL +
MVN $7F7F
BRA ++
+ MVN $7E7E
++ LDX $8A
BRA .bra
.main
STX $8A
.bra
LDA [$8A]
INX
BNE +
JSR BANK_INC
+ STA $8D
STZ $8E
AND #$E0
TRB $8D
.type
ASL
BCS .case_80_or_e0
BEQ .case_00
BMI .case_40_or_60
.case_20
%ReadByte()
STX $0B
PHA
PHA
REP #$20
.case_20_main
LDA $8D
INC
LSR
TAX
PLA
- STA $0000,Y
INY
INY
DEX
BNE -
SEP #$20
LDX $0B
BCC .main
STA $0000,Y
INY
BRA .main
.case_40_or_60
ASL
BMI .case_60
REP #$20
%ReadWord()
STX $0B
PHA
BRA .case_20_main
.case_60
%ReadByte()
STX $0B
LDX $8D
- STA $0000,Y
INC
INY
DEX
BPL -
LDX $0B
JMP .main
BANK_INC:
LDX #$8000
INC $06
INC $8C
RTS
BANK_INC_2:
CPX #$0001
LDX #$8000
INC $06 ; $07($8D) is not affected.
INC $8C
BCC +
SEP #$20
XBA
STX $8A
LDA [$8A]
XBA
INX
REP #$21
+ RTS
CodeEnd:
print "Insert Size: ",bytes," bytes"
DMA ver. (not use SRAM)
// updated
Code
lorom
header
!Freespace = $1D8000|$800000
; Note:
; - This routine uses $00:211B-$00:211C,
; $00:2134-$00:2136, $00:435x-$00:436x.
;
; - This routine doesn't use WRAM->SRAM->WRAM DMA,
; because it will fail in some cases.
;
; Error case:
; CompData: <02:01 02 03> <85:00 00> <FF>
; dest: $7F0000
;
; <02:01 02 03>
; 01 02 03 .. .. .. .. .. ..
; 01 02 03 .. .. .. .. .. ..
;
; <85:00 00>
; 01 02 03 01 02 03 01 02 03 ; Default / MVN
; 01 02 03 01 02 03 .. .. .. ; (for example)
; WRAM->SRAM ($7F:0000-$7F:0005 -> $70:0000-$70:0005)
; SRAM->WRAM ($70:0000-$70:0005 -> $7F:0003-$7F:0008)
;
; - It is needed to change the value of D register
; from $4300 to $0000 at the beginning of NMI/IRQ.
;
macro ReadByte()
LDA [$67],Y
INY
BNE +
JSR BANK_INC
+
endmacro
macro ReadWord()
LDA [$67],Y
INY
INY
BMI +
JSR BANK_INC_2
+
endmacro
org $80B8E3
JSL CodeStart
RTS
org !Freespace
reset bytes
db "STAR"
dw CodeEnd-CodeStart-$0001
dw CodeEnd-CodeStart-$0001^$FFFF
CodeStart:
PHB
PHK
PLB
PHD
PEA $4300
PLD
LDX #$3480
STX $50 ; dma_param
LDX $0000
STX $58 ; dest_low[start] / (HDMA Table Address)
LDA $0002
STA $54 ; dest_bank
STA $2183
INC
STA $57 ; [Plus or Minus] / (HDMA Indirect Bank)
LDY #$8000
STY $60 ; dma_param
LDY $008A
STZ $67 ; src_low
STZ $68 ; src_high
LDA $008C
STA $64 ; src_bank
STA $69 ; src_bank
JMP .main
.end
PLD
REP #$20
TXA
SBC $00 ; carry = 1
STA $8D ; size
SEP #$30
PLB
RTL
.case_e0
LDA $65
CMP #$1F
BEQ .end
AND #$03
STA $66
EOR $65
ASL #3
XBA
%ReadByte()
STA $65
XBA
BRA .type
.case_00
REP #$21
INC $65 ; bytecount
STX $2181
TXA
ADC $65
TAX
STY $62
- SEP #$20
LDA #$40
STA $420B
LDY $62
BMI .main
INC $64
INC $69
CPY #$0000
BEQ ++
REP #$20
STY $65
LDY #$8000
STY $62
TXA
SBC $65 ; carry = 1
STA $2181
BRA -
++ LDY #$8000
BRA .main
.case_80_or_e0
BMI .case_e0
REP #$21
%ReadWord()
STY $52 ; tmp
TXY
XBA
ADC $58
TAX
LDA $65
SEP #$20
PHB
BIT $57
BPL +
MVN $7F7F
BRA ++
+ MVN $7E7E
++ TYX
LDY $52
PLB
.main
%ReadByte()
STA $65
STZ $66
AND #$E0
TRB $65
.type
ASL
BCS .case_80_or_e0
BEQ .case_00
BMI .case_40_or_60
.case_20
%ReadByte()
STA $211B
STZ $211B
LDA #$80
STA $50 ; param
STX $52
LDX $65
INX
STX $55
LDA #$01
STA $211C
LDA #$20
STA $420B
LDX $52
BRA .main
.case_40_or_60
ASL
BMI .case_60
.case_40
REP #$20
%ReadWord()
SEP #$20
STA $211B
XBA
STA $211B
LDA #$81
STA $50 ; param
STX $52
LDX $65
INX
STX $55
LDA #$01
STA $211C
LDA #$20
STA $420B
LDX $52
BRA .main
.case_60
%ReadByte()
STX $2181
STY $52 ; tmp
LDY $65
- STA $2180
INC
INX
DEY
BPL -
LDY $52
JMP .main
BANK_INC:
LDY #$8000
INC $64
INC $69
RTS
BANK_INC_2:
CPY #$0001
LDY #$8000
INC $64
INC $69
BCC +
SEP #$20
XBA
LDA [$67],Y
INY
XBA
REP #$21
+ RTS
CodeEnd:
print "Insert Size: ",bytes," bytes"
|
| Last edited on 2010-05-27 06:54:55 AM by Min. |
|
| Posted on 2010-05-27 08:07:03 AM |
Link | Quote |
|
for some reason the DMA version is screwing up the new (yet-to-be-released) Layer3ExGFX that reloads GFX on submap change (including FG slots) w/o fblank. It causes layer 1 and sprites to "flicker" above the windowing effects. I think it might have something to do with the DMA transfer that uses channel 6, because when I comment out the STA $420B it stops flickering (of course messes up the GFX load too though), but when I change the channel it doesn't help at all.
I'm looking into a solution right now, because I really think this should be addressed.
EDIT: I just tested it in SNES9x, and the problem isn't exactly the same but it is there. the whole screen flickers black as if entering fblank every other frame.
I guess the routine has a problem running outside of blank, which is a problem for any "OW ExGraFix" patches that reload GFX on submap change without fblank
EDIT2: BSNES does the same
EDIT3: I'm not even sure if the DMA version is faster. And if it is slightly faster, I think we should continue to use the non-DMA version because compatibility > tiny speed increases.
|
| Last edited on 2010-05-27 09:14:50 AM by edit1754. |
|
| Posted on 2010-05-27 09:14:56 AM |
Link | Quote |
|
|
@edit: It writes to CH5/CH6 registers, any possibility CH5 is messing things up for whatever you are doing? The registers themselves are readable so maybe push them before entering the routine and see if it helps.
|
|
|
Pages: 1 2  |
|
|
|
|
Forum Index - SMW Hacking - General SMW Hacking Help - ASM & Related Topics - Optimize the LC_LZ2 decompression! |
|
|
 |
|
 |
The purpose of this site is not to distribute copyrighted material, but to honor one of our favourite games.
Copyright © 2005 - 2013 - SMW Central Legal Information - Link To UsTotal queries: 27
|
|
|
|