Language…
12 users online:  Ahrion,  bebn legg,  Donut, Erik, gorpo_c, Heraga, HevonPillu, Peach, Pink Gold Peach, Reiko3, Scrydan, Zavok - Guests: 92 - Bots: 115
Users: 67,998 (2,140 active)
Latest user: deusdetemcjunior

Posts by Drex

Drex's Profile → Posts

Originally posted by Final Theory
If I had the skillz I would make a game that utilizes tons of custom chips, but I just dont have the programming skills required for it.

Also with the sd2snes, that cartridge contains lots of custom chips and if you could utilize all of those chips in on in a single game then you could make probably one of the most fascinating games ever made since I don't think any of the original snes games contained all of those such custom chips in a single game.

Again if I were a super savant programmer here I would be all into these custom chips.

Also I think the super FX is really the best one we should all be working on with getting it to work on our hacks. Its like 20 MZ right? So as far as I know, its the single most powerful chip that ever existed on the actual snes.


You could still do a lot of cool things on stock hardware just by CPU tricks. I want to see more people try to do stuff like this on the SNES.

https://www.youtube.com/watch?v=Pe_NSqiu2X4

http://bin.smwcentral.net/u/28835/Alisha%2527s%2BAdventure%2BSource.zip
Originally posted by zack30
Originally posted by Ladida
that screenshot is indeed my creation. however, it is not a mode 5 test; it's a mode 5 status bar test (the rest of the level is still mode 1)

the main problem with hi-res tilemaps is that they require double the VRAM of a normal tilemap. quadruple if you enable interlace to get double the vertical res. SMW's VRAM just isn't that flexible, especially with smkd's VRAM patch


there is an actual mode 5 level test made by Roy somewhere, but it probably got lost in the flow of time

Originally posted by zack30
In the case that you'd have to worry about slowdown from the increased resolution, you could always resort to using SA1.

the slowdown would result from the increased amount of data that needs to be DMA'd to VRAM. SA1 cannot solve that


Ah, so it would be a pain to use Hires mode in SMW, the main problems being VRAM related.
That's the general thing I hear about Hires mode: There's just not enough VRAM to run SMW fully in hires mode.
Would it be a problem to only enable Hires mode and not tamper with anything else such as tileset resolution, or would that still be difficult?


VRAM is only an issue if you're using a ton of unique BG tiles, and DMA is only an issue when you're animating a lot of unique BG tiles at once.

IMHO, you guys worry about limitations wayyy too much.
Have you guys actually tried to optimize the code at all? Anybody know where I can find stuff like the sprite drawing routine and collision, so I can see if they can be improved?
I found SMW's sprite drawing code, and it looks like it can be speeded up quite a bit. In multiple places it looks like it can really benefit from 16-bit mode, and some of the math looks redundant.




FinishOAMWrite: 8B PHB ; Wrapper
CODE_01B7B4: 4B PHK
CODE_01B7B5: AB PLB
CODE_01B7B6: 20 BB B7 JSR.W FinishOAMWriteRt
CODE_01B7B9: AB PLB
Return01B7BA: 6B RTL ; Return

FinishOAMWriteRt: 84 0B STY $0B
CODE_01B7BD: 85 08 STA $08
CODE_01B7BF: BC EA 15 LDY.W RAM_SprOAMIndex,X ; Y = Index into sprite OAM
CODE_01B7C2: B5 D8 LDA RAM_SpriteYLo,X
CODE_01B7C4: 85 00 STA $00
CODE_01B7C6: 38 SEC
CODE_01B7C7: E5 1C SBC RAM_ScreenBndryYLo
CODE_01B7C9: 85 06 STA $06 ; put Y - Ycamera, in $06
CODE_01B7CB: BD D4 14 LDA.W RAM_SpriteYHi,X
CODE_01B7CE: 85 01 STA $01 ; put Y in $00, could've used 16-bit mode
CODE_01B7D0: B5 E4 LDA RAM_SpriteXLo,X
CODE_01B7D2: 85 02 STA $02
CODE_01B7D4: 38 SEC
CODE_01B7D5: E5 1A SBC RAM_ScreenBndryXLo
CODE_01B7D7: 85 07 STA $07 ; put X - Ycamera, in $07
CODE_01B7D9: BD E0 14 LDA.W RAM_SpriteXHi,X
CODE_01B7DC: 85 03 STA $03 ; put X in $02, could've used 16-bit mode
CODE_01B7DE: 98 TYA
CODE_01B7DF: 4A LSR
CODE_01B7E0: 4A LSR
CODE_01B7E1: AA TAX
CODE_01B7E2: A5 0B LDA $0B
CODE_01B7E4: 10 0A BPL CODE_01B7F0
CODE_01B7E6: BD 60 04 LDA.W OAM_TileSize,X
CODE_01B7E9: 29 02 AND.B #$02
CODE_01B7EB: 9D 60 04 STA.W OAM_TileSize,X
CODE_01B7EE: 80 03 BRA CODE_01B7F3

CODE_01B7F0: 9D 60 04 STA.W OAM_TileSize,X
CODE_01B7F3: A2 00 LDX.B #$00
CODE_01B7F5: B9 00 03 LDA.W OAM_DispX,Y
CODE_01B7F8: 38 SEC
CODE_01B7F9: E5 07 SBC $07
CODE_01B7FB: 10 01 BPL CODE_01B7FE
CODE_01B7FD: CA DEX
CODE_01B7FE: 18 CLC
CODE_01B7FF: 65 02 ADC $02 ; old X - new X + Xcamera + new X = $04
CODE_01B801: 85 04 STA $04
CODE_01B803: 8A TXA
CODE_01B804: 65 03 ADC $03 ; once again, why isn't this using 16-bit mode?
CODE_01B806: 85 05 STA $05
CODE_01B808: 20 44 B8 JSR.W CODE_01B844 ; pointless jump to subroutine
CODE_01B80B: 90 0C BCC CODE_01B819
CODE_01B80D: 98 TYA ; already did this, why didn't they phx/plx?
CODE_01B80E: 4A LSR
CODE_01B80F: 4A LSR
CODE_01B810: AA TAX
CODE_01B811: BD 60 04 LDA.W OAM_TileSize,X
CODE_01B814: 09 01 ORA.B #$01
CODE_01B816: 9D 60 04 STA.W OAM_TileSize,X
CODE_01B819: A2 00 LDX.B #$00
CODE_01B81B: B9 01 03 LDA.W OAM_DispY,Y
CODE_01B81E: 38 SEC
CODE_01B81F: E5 06 SBC $06
CODE_01B821: 10 01 BPL CODE_01B824
CODE_01B823: CA DEX
CODE_01B824: 18 CLC
CODE_01B825: 65 00 ADC $00 ; old Y - new Y + Ycamera + new Y = $09
CODE_01B827: 85 09 STA $09
CODE_01B829: 8A TXA
CODE_01B82A: 65 01 ADC $01
CODE_01B82C: 85 0A STA $0A
CODE_01B82E: 20 BF C9 JSR.W CODE_01C9BF ; another pointless jump
CODE_01B831: 90 05 BCC CODE_01B838
CODE_01B833: A9 F0 LDA.B #$F0
CODE_01B835: 99 01 03 STA.W OAM_DispY,Y
CODE_01B838: C8 INY
CODE_01B839: C8 INY
CODE_01B83A: C8 INY
CODE_01B83B: C8 INY
CODE_01B83C: C6 08 DEC $08
CODE_01B83E: 10 9E BPL CODE_01B7DE
CODE_01B840: AE E9 15 LDX.W $15E9 ; X = Sprite index
Return01B843: 60 RTS ; Return

CODE_01B844: C2 20 REP #$20 ; Accum (16 bit)
CODE_01B846: A5 04 LDA $04
CODE_01B848: 38 SEC
CODE_01B849: E5 1A SBC RAM_ScreenBndryXLo
CODE_01B84B: C9 00 01 CMP.W #$0100
CODE_01B84E: E2 20 SEP #$20 ; Accum (8 bit)
Return01B850: 60 RTS ; Return

CODE_01C9BF: C2 20 REP #$20 ; Accum (16 bit)
CODE_01C9C1: A5 09 LDA $09
CODE_01C9C3: 48 PHA
CODE_01C9C4: 18 CLC
CODE_01C9C5: 69 10 00 ADC.W #$0010
CODE_01C9C8: 85 09 STA $09
CODE_01C9CA: 38 SEC
CODE_01C9CB: E5 1C SBC RAM_ScreenBndryYLo
CODE_01C9CD: C9 00 01 CMP.W #$0100
CODE_01C9D0: 68 PLA
CODE_01C9D1: 85 09 STA $09
CODE_01C9D3: E2 20 SEP #$20 ; Accum (8 bit)
Return01C9D5: 60 RTS ; Return





Optimized Version:





FinishOAMWrite:
PHB ; Wrapper
PHK
PLB
JSR.W FinishOAMWriteRt
PLB
RTL ; Return

FinishOAMWriteRt:
STY $0B
STA $08
LDY.W RAM_SprOAMIndex,X ; Y = Index into sprite OAM
rep #$20
LDA RAM_SpriteYLo,X
STA $00
SEC
SBC RAM_ScreenBndryYLo
STA $06 ; put Y - Ycamera, in $06

LDA RAM_SpriteXLo,X
STA $02
sep #$21
SBC RAM_ScreenBndryXLo
STA $07 ; put X - Ycamera, in $07


-;
TYA
LSR
LSR
TAX
phx
LDA $0B
BPL +
LDA.W OAM_TileSize,X
AND.B #$02
STA.W OAM_TileSize,X
BRA ++


+;
STA.W OAM_TileSize,X
+;
LDX.B #$00
LDA.W OAM_DispX,Y
SEC
SBC $07
BPL +
DEX
+;
xba
txa
xba
rep #$20
CLC
ADC $02 ; old X - new X + Xcamera + new X = $04
STA $04 ; lots of redundancy here!!!

SEC
SBC RAM_ScreenBndryXLo
CMP.W #$0100
SEP #$20 ; Accum (8 bit)

plx

BCC +
LDA.W OAM_TileSize,X
ORA.B #$01
STA.W OAM_TileSize,X
+;
LDX.B #$00
LDA.W OAM_DispY,Y
SEC
SBC $06
BPL +
DEX
+;
xba
txa
xba
rep #$20

CLC
ADC $00 ; old Y - new Y + Ycamera + new Y = $09
STA $09
PHA
CLC
ADC.W #$0010
STA $09
SEC
SBC RAM_ScreenBndryYLo
CMP.W #$0100
PLA
STA $09
SEP #$20 ; Accum (8 bit)



BCC +
LDA.B #$F0
STA.W OAM_DispY,Y
+;
INY
INY
INY
INY
DEC $08
BPL -
LDX.W $15E9 ; X = Sprite index
RTS ; Return

; Return
Does anyone know if Super Mario World reads from $00-$0a after jumping to this routine? If it doesn't matter what's in $00-0a, I can optimize it further:

FinishOAMWrite:
PHB // Wrapper
PHK
PLB
JSR.W FinishOAMWriteRt
PLB
RTL // Return

FinishOAMWriteRt:
STY $0B
STA $08
LDY.W RAM_SprOAMIndex,X // Y = Index into sprite OAM
rep #$20
LDA RAM_SpriteYLo,X
SEC
SBC RAM_ScreenBndryYLo
STA $00

LDA RAM_SpriteXLo,X
sec
SBC RAM_ScreenBndryXLo
STA $02
sep #$20

-;
TYA
LSR
LSR
TAX
phx
LDA $0B
BPL +
LDA.W OAM_TileSize,X
AND.B #$02

+;
STA.W OAM_TileSize,X

LDX.B #$00
LDA.W OAM_DispX,Y
SEC
SBC $02
BPL +
DEX
+;
xba
txa
xba
rep #$21

ADC $02
CMP.W #$0100
SEP #$20 // Accum (8 bit)

plx

BCC +
LDA.W OAM_TileSize,X
ORA.B #$01
STA.W OAM_TileSize,X
+;
LDX.B #$00
LDA.W OAM_DispY,Y
SEC
SBC $00
BPL +
DEX
+;
xba
txa
xba
rep #$21
ADC $00
CLC
ADC.W #$0010
CMP.W #$0100

SEP #$20 // Accum (8 bit)
BCC +
LDA.B #$F0
STA.W OAM_DispY,Y
+;
INY
INY
INY
INY
DEC $08
BPL -
LDX.W $15E9 // X = Sprite index
RTS // Return



EDIT:
Are these the correct addresses?
define OAM_DispX($0300)
define OAM_DispY($0301)
define RAM_SpriteYLo($00d8)
define RAM_SpriteXLo($00e4)
define RAM_SpriteYHi($14d4)
define RAM_SpriteXHi($14e0)
define RAM_SprOAMIndex($15ea)
define RAM_ScreenBndryXLo($001a)
define RAM_ScreenBndryYLo($001c)
define OAM_TileSize($0420)
The first part of the code would have to be reverted back to 8-bit mode because the high and low parts of the object's X and Y coordinates are stored separately.

EDIT:

This version of the code works for sure. I'm using bass.exe. I heard that most people here use asar, which might be a little bit different.

Quote
arch snes.cpu

macro seek(n) {
origin (({n} & 0x7f0000) >> 1) | ({n} & 0x7fff)
base {n}
}

define OAM_DispX($0300)
define OAM_DispY($0301)
define RAM_SpriteYLo($d8)
define RAM_SpriteXLo($e4)
define RAM_SpriteYHi($14d4)
define RAM_SpriteXHi($14e0)
define RAM_SprOAMIndex($15ea)
define RAM_ScreenBndryXLo($1a)
define RAM_ScreenBndryYLo($1c)
define OAM_TileSize($0460)

seek($01b7bb)
jsl FinishOAMWriteRt
rts



seek($108000)

FinishOAMWriteRt:
sty $0b
sta $08
ldy {RAM_SprOAMIndex},x // Y = Index into sprite OAM


lda {RAM_SpriteYLo},x
sta $00
sec
sbc {RAM_ScreenBndryYLo}
sta $06
lda {RAM_SpriteYHi},x
sta $01


lda {RAM_SpriteXLo},x
sta $02
sec
sbc {RAM_ScreenBndryXLo}
sta $07

lda {RAM_SpriteXHi},x
sta $03


-;
tya
lsr
lsr
tax
phx
lda $0b
bpl +
lda.w {OAM_TileSize},x
and.b #$02

+;
sta.w {OAM_TileSize},x

ldx.b #$00
lda {OAM_DispX},y
sec
sbc $07
bpl +
dex
+;
xba
txa
xba
rep #$21

adc $02
sta $04
sec
sbc {RAM_ScreenBndryXLo}
cmp.w #$0100
sep #$20 // Accum (8 bit)

plx

bcc +
lda.w {OAM_TileSize},x
ora.b #$01
sta.w {OAM_TileSize},x
+;
ldx.b #$00
lda {OAM_DispY},y
sec
sbc $06
bpl +
dex
+;
xba
txa
xba
rep #$21
adc $00
sta $09
clc
adc.w #$0010
sec
sbc {RAM_ScreenBndryYLo}

cmp.w #$0100

sep #$20 // Accum (8 bit)
bcc +
lda.b #$f0
sta {OAM_DispY},y
+;
iny
iny
iny
iny
dec $08
bpl -
ldx.w $15e9 // X = Sprite index
rtl // Return
I've optimized it even further. If you're creating custom sprites, you can bypass this routine completely and writing the high X bit and size bit directly, to speed it up even more.

Quote
arch snes.cpu

macro seek(n) {
origin (({n} & 0x7f0000) >> 1) | ({n} & 0x7fff)
base {n}
}

define OAM_DispX($0300)
define OAM_DispY($0301)
define RAM_SpriteYLo($d8)
define RAM_SpriteXLo($e4)
define RAM_SpriteYHi($14d4)
define RAM_SpriteXHi($14e0)
define RAM_SprOAMIndex($15ea)
define RAM_ScreenBndryXLo($1a)
define RAM_ScreenBndryXHi($1b)
define RAM_ScreenBndryYLo($1c)
define RAM_ScreenBndryYHi($1d)
define OAM_TileSize($0460)

seek($01b7bb)


FinishOAMWriteRt:
sty $0b
sta $08
ldy {RAM_SprOAMIndex},x // Y = Index into sprite OAM


lda {RAM_SpriteYLo},x
sec
sbc {RAM_ScreenBndryYLo}
sta $00
lda {RAM_SpriteYHi},x
sbc {RAM_ScreenBndryYHi}
sta $01


lda {RAM_SpriteXLo},x

sec
sbc {RAM_ScreenBndryXLo}
sta $02
lda {RAM_SpriteXHi},x
sbc {RAM_ScreenBndryXHi}
sta $03

tya
lsr
lsr
sta $07


-;




ldx.b #$00
lda {OAM_DispX},y
sec
sbc $02
bpl +
dex
+;
xba
txa
xba
rep #$21

adc $02
cmp.w #$0100
sep #$20 // Accum (8 bit)

ldx $07

lda #$00
bcc +

lda #$01

+;

sta $06

lda $0b
bpl +
lda.w {OAM_TileSize},x
and.b #$02
+;
ora $06
sta.w {OAM_TileSize},x

inx
stx $07


ldx.b #$00
lda {OAM_DispY},y
sec
sbc $00
bpl +
dex
+;
xba
txa
xba
rep #$21
adc $00
clc
adc.w #$0010
cmp.w #$0100

sep #$20 // Accum (8 bit)
bcc +
lda.b #$f0
sta {OAM_DispY},y
+;
iny
iny
iny
iny
dec $08
bpl -
ldx.w $15e9 // X = Sprite index
rts // Return



Quote
Edit: So after a bit of testing I can say there is a tiny bump of performance. I don't think this will matter for the average person but people pushing the limits on processing would benefit from this, like Vitor for example.


There's probably more slow code in the game than just this routine.
Do you know which sprites are good for testing slowdown? It seems like when I try to spam enemies, they stop spawning before enough is onscreen to cause slowdown anyway, and this is without the patch.
Are there any patches that allows you to have more than 12 objects?, other than SA-1?
I was looking for other routines to speed up, and I found another graphic routine that eventually jumps to the code that I've optimized, that I believe can be merged.

Then I looked it up in the disassembly and found out it's specifically for Banzai Bill. Do all enemy's have separate drawing routines? I guess I can optimize this, but also generalize it to use for other sprites too.

Quote

CODE_02D5E4: 20 78 D3 JSR.W GetDrawInfo2
CODE_02D5E7: DA PHX
CODE_02D5E8: A2 0F LDX.B #$0F
CODE_02D5EA: A5 00 LDA $00
CODE_02D5EC: 18 CLC
CODE_02D5ED: 7D A4 D5 ADC.W DATA_02D5A4,X
CODE_02D5F0: 99 00 03 STA.W OAM_DispX,Y
CODE_02D5F3: A5 01 LDA $01
CODE_02D5F5: 18 CLC
CODE_02D5F6: 7D B4 D5 ADC.W DATA_02D5B4,X
CODE_02D5F9: 99 01 03 STA.W OAM_DispY,Y
CODE_02D5FC: BD C4 D5 LDA.W BanzaiBillTiles,X
CODE_02D5FF: 99 02 03 STA.W OAM_Tile,Y
CODE_02D602: BD D4 D5 LDA.W DATA_02D5D4,X
CODE_02D605: 99 03 03 STA.W OAM_Prop,Y
CODE_02D608: C8 INY
CODE_02D609: C8 INY
CODE_02D60A: C8 INY
CODE_02D60B: C8 INY
CODE_02D60C: CA DEX
CODE_02D60D: 10 DB BPL CODE_02D5EA
CODE_02D60F: FA PLX
CODE_02D610: A0 02 LDY.B #$02
CODE_02D612: A9 0F LDA.B #$0F
CODE_02D614: 4C A7 B7 JMP.W CODE_02B7A7
Originally posted by mario90
Originally posted by Drex
Are there any patches that allows you to have more than 12 objects?, other than SA-1?


No there isn't. I'm pretty sure having more than 12 would be too taxing on the game, considering most of the time you can't even reach 12 sprites on screen without slowdown or other sprites not showing up. You could make it possible by doing what SA-1 does and move and expand the tables, but that seems like it'll be a ton of work for little gain for the reason I just stated.


Well, that's the reason why I'm doing optimizations. So it can have more sprites with less slowdown.
Yoshifanatic, how are you changing RAM addresses, and how are you reassembling code without corrupting the graphics and crossing bank boundaries?
Originally posted by MelodicCodes
It's just become a knee-jerk reaction of mine to bring that up every time I hear a discussion about graphics in game design. So many people think that graphics are instantly the focus of a game's development while forgetting aesthetics almost entirely.

As for why we haven't seen patches like these, one has to keep in mind that graphics are simply a tool for the aesthetics. For most people, smoke trails and flashy stomping effects aren't really necessary for the feeling and aesthetic they're trying to portray.

Simply put, why include something that wastes precious CPU cycles and sprite memory, when it doesn't contribute much to your aesthetic or gameplay?
Having said that, I can see why someone may want to use things like this. If you want some fancy sprite particle effects, I direct you to Ladida's Cluster effects. They're pretty simple to use and can greatly enhance the graphics of your hack, but may induce slowdown. Or, if you like tinkering with ASM, there's always VWF dialogues, but I totally understand if one doesn't want to mess with them.

Graphical fidelity is increased dramatically by HDMA, speaking of ASM tinkering. It can be unintuitive to use at first, but HDMA can make a hack look several times better with little to no processing cost or slowdown.

As for your examples of more stomping effects and/or bullet trails, I got nothing.


Seriously, a couple of little smoke trail clouds aren't going to bog the CPU down much.
Did you take advantage of tiles that are completely black or completely white?
It can save dma time so you no longer need letter boxing.
It will also save CPU decompression time because it wouldn't have to decompress blank tiles.

I made a Bad Apple SNES demo last year that ran at 30fps, and one of the things I did was only decompressing tiles that have details. The sound quality wasn't as good as your demo though, so I don't how much you're limited with ROM space.
Today, I was having fun optimizing my sprite rotation algorithm I made a while ago for my homebrew game. How it works is it runs in the leftover CPU time in my game (whenever it is not calculating game logic) and it automatically fills the RAM with as many frames as possible (it's a compromise between smoothness, set up time, and amount of unique sprites) so when you approach a certain part of the level, the rotation is already calculated.

I got it just fast enough to do a 32x32 sprite per frame, if there is nothing happening onscreen. It doesn't look that fast on paper, but for a rotating 32x32 sprite with 64 angles, it only needs 32 frames because of horizontal/vertical flipping, which takes up only half a second.

I'm not that good with making notes, but I used tricks such as expecting positive numbers to ALWAYS leave the carry bit clear, and negative numbers to ALWAYS leave the carry bit set. I also used lookup tables to convert pixels to planar format.

Code
rotate_sprites_for_modular_animation:
-;
lda $0000,y
beq +
tax
phy
jsr rotate_sprite
ply
iny #2
lda.l {terminate_rotation}
beq -
+;
stz {modular_animation_data}
stz {terminate_rotation}
rts

rotate_sprite:
phb
php
sep #$20

lda #$80
sta {scratch_pad_ram}+34      //the LUTs are in bank $80
sta {scratch_pad_ram}+38      //These are pointers for the
sta {scratch_pad_ram}+42      //LUTs
sta {scratch_pad_ram}+46



lda $0004,x                   //$0000,x is ROM address
sta {rotation_step}           //$0002,x is ROM bank
stz {rotation_angle}          //$0004,x is rotation step amount
lda $000a,x                   //$0006,x is RAM address
asl #3                        //$0008,x is RAM bank
sta {size}                    //$000a,x is size 2=16x16, 4=32x32
asl #2
sta {d}
lda $0000,x
stz {x_pixel}
sta {x_pixel_hi}
lda $0001,x
stz {y_pixel}
sta {y_pixel_hi}
lda $0002,x
sta {scratch_pad_ram}+3       //these are the banks of the
sta {scratch_pad_ram}+7       //"pixel pointers"
sta {scratch_pad_ram}+11
sta {scratch_pad_ram}+15
sta {scratch_pad_ram}+19
sta {scratch_pad_ram}+23
sta {scratch_pad_ram}+27
sta {scratch_pad_ram}+31

lda $0008,x
pha
rep #$20


lda $0006,x
tax

plb                         //make data bank hold the RAM bank
phd                         //and X hold the destination address
lda #$0000
tcd
jsr convert_bitmap
pld
plp
plb
rts

new_rotation_step:             //I forgot how I did this math stuff
                               //but it's kind've like setting up
pla                            //mode 7 registers
sta.b {y_pixel}
pla
sta.b {x_pixel}

lda.l {terminate_rotation}
beq +
rts
+;

lda.b {rotation_step}
clc
adc.b {rotation_angle}
sta.b {rotation_angle}
cmp #$0080
bcc convert_bitmap
rts


convert_bitmap:
sep #$20
lda #$00
sta $004200
rep #$20

phx
lda.b {rotation_angle}
asl
and #$01fe
tax
lda $000000+sine,x
sta.b {sine}
lda $000000+cosine,x
sta.b {cosine}
plx

lda.b {x_pixel}
pha
lda.b {y_pixel}
pha

lda.b {sine}
clc
adc.b {cosine}
sta.b {a}
sep #$20
sta $00211b
xba
sta $00211b
lda.b {size}
lsr
sta $00211c
rep #$20
lda.b {size}
xba
clc
adc.b {a}
lsr
sec
sbc $002134
clc
adc.b {x_pixel}
sta.b {x_pixel}
lda.b {cosine}
sec
sbc.b {sine}
sta.b {a}
sep #$20
sta $00211b
xba
sta $00211b
lda.b {size}
lsr
sta $00211c
rep #$20
lda.b {size}
xba
clc
adc.b {a}
lsr

sec
sbc $002134
clc
adc.b {y_pixel}
sta.b {y_pixel}
lda.b {size}
sta.b {c}
lsr #3
sta.b {b}
lda.b {size}
asl #2
sta.b {a}

sep #$20
lda #$81
sta $004200
rep #$20

lda.b {cosine}          //adjust sine and cosine so clc and sec
bpl +                   //are not needed
dec
sta.b {cosine}
+;

lda.b {sine}
bpl +
dec
sta.b {sine}
+;

convert_bitmap_loop:
lda.b {c}
bne old_rotation_step
jmp new_rotation_step

old_rotation_step:
lda.b {x_pixel}
pha
lda.b {y_pixel}
pha
convert_line:


jmp convert_pixel

convert_pixel_done:

txa
clc
adc #$0020
tax
lda.b {b}
bne convert_pixel
pla
clc
adc.b {cosine}
adc #$0000
sta.b {y_pixel}
pla
clc
adc.b {sine}
adc #$0000
sta.b {x_pixel}
dec.b {c}
lda.b {size}
lsr #3

sta.b {b}
lda.b {size}
txa
sec
sbc.b {a}
inc #2
tax
bit #$000e
bne convert_bitmap_loop
clc
adc.b {d}
sec
sbc #$0010
tax
jmp convert_bitmap_loop


convert_pixel:                    //This is the most important part
                                  //of this code.
dec.b {b}
lda.b {y_pixel}                   //This is where pixels get drawn.
sta.b {scratch_pad_ram}+1
sec
sbc.b {sine}
sbc #$0000
sta.b {scratch_pad_ram}+5         //First it calculates the Y position
sbc.b {sine}                      //of every pixel,
sta.b {scratch_pad_ram}+9
sbc.b {sine}
sta.b {scratch_pad_ram}+13
sbc.b {sine}
sta.b {scratch_pad_ram}+17
sbc.b {sine}
sta.b {scratch_pad_ram}+21
sbc.b {sine}
sta.b {scratch_pad_ram}+25
sbc.b {sine}
sta.b {scratch_pad_ram}+29
sbc.b {sine}
sta.b {y_pixel}

lda.b {x_pixel}                   //Then it calculates the X position
sta.b {scratch_pad_ram}           //of every pixel.
clc
adc.b {cosine}
adc #$0000
sta.b {scratch_pad_ram}+4         //The top byte of the X position
adc.b {cosine}                    //overwrites the low byte of the
sta.b {scratch_pad_ram}+8         //Y position, creating the ROM
adc.b {cosine}                    //address of the pixel, in the
sta.b {scratch_pad_ram}+12        //format: bbbbbbbbyyyyyyyyxxxxxxxx
adc.b {cosine}                    //where b is the bank, and x and y
sta.b {scratch_pad_ram}+16        //are coordinates in a 256x256
adc.b {cosine}                    //bitmap image in the bank that
sta.b {scratch_pad_ram}+20        //contain the rotatable sprites.
adc.b {cosine}
sta.b {scratch_pad_ram}+24
adc.b {cosine}
sta.b {scratch_pad_ram}+28
adc.b {cosine}
sta.b {x_pixel}

lda [{scratch_pad_ram}+1]         //now it calculates the offsets of
asl #4                            //the planar look up tables
ora [{scratch_pad_ram}+17]
and #$00ff
asl
sta.b {scratch_pad_ram}+32
lda [{scratch_pad_ram}+5]
asl #4
ora [{scratch_pad_ram}+21]
and #$00ff
asl
sta.b {scratch_pad_ram}+36
lda [{scratch_pad_ram}+9]
asl #4
ora [{scratch_pad_ram}+25]
and #$00ff
asl
sta.b {scratch_pad_ram}+40
lda [{scratch_pad_ram}+13]
asl #4
ora [{scratch_pad_ram}+29]
and #$00ff
asl
sta.b {scratch_pad_ram}+44


ldy #packed_to_planar_lo
lda [{scratch_pad_ram}+32],y       //now it packs together bitplanes
asl                                //0 and 1
ora [{scratch_pad_ram}+36],y
asl
ora [{scratch_pad_ram}+40],y
asl
ora [{scratch_pad_ram}+44],y
sta $0000,x

ldy #packed_to_planar_hi           //now it packs together bitplanes
lda [{scratch_pad_ram}+32],y       //2 and 3
asl
ora [{scratch_pad_ram}+36],y
asl
ora [{scratch_pad_ram}+40],y
asl
ora [{scratch_pad_ram}+44],y
sta $0010,x

jmp convert_pixel_done

packed_to_planar_lo:
dw $0000,$0001,$0100,$0101,$0000,$0001,$0100,$0101,$0000,$0001,$0100,$0101,$0000,$0001,$0100,$0101	//DCBAdcba > ---B---b---A---a
dw $0010,$0011,$0110,$0111,$0010,$0011,$0110,$0111,$0010,$0011,$0110,$0111,$0010,$0011,$0110,$0111
dw $1000,$1001,$1100,$1101,$1000,$1001,$1100,$1101,$1000,$1001,$1100,$1101,$1000,$1001,$1100,$1101
dw $1010,$1011,$1110,$1111,$1010,$1011,$1110,$1111,$1010,$1011,$1110,$1111,$1010,$1011,$1110,$1111
dw $0000,$0001,$0100,$0101,$0000,$0001,$0100,$0101,$0000,$0001,$0100,$0101,$0000,$0001,$0100,$0101
dw $0010,$0011,$0110,$0111,$0010,$0011,$0110,$0111,$0010,$0011,$0110,$0111,$0010,$0011,$0110,$0111
dw $1000,$1001,$1100,$1101,$1000,$1001,$1100,$1101,$1000,$1001,$1100,$1101,$1000,$1001,$1100,$1101
dw $1010,$1011,$1110,$1111,$1010,$1011,$1110,$1111,$1010,$1011,$1110,$1111,$1010,$1011,$1110,$1111
dw $0000,$0001,$0100,$0101,$0000,$0001,$0100,$0101,$0000,$0001,$0100,$0101,$0000,$0001,$0100,$0101
dw $0010,$0011,$0110,$0111,$0010,$0011,$0110,$0111,$0010,$0011,$0110,$0111,$0010,$0011,$0110,$0111
dw $1000,$1001,$1100,$1101,$1000,$1001,$1100,$1101,$1000,$1001,$1100,$1101,$1000,$1001,$1100,$1101
dw $1010,$1011,$1110,$1111,$1010,$1011,$1110,$1111,$1010,$1011,$1110,$1111,$1010,$1011,$1110,$1111
dw $0000,$0001,$0100,$0101,$0000,$0001,$0100,$0101,$0000,$0001,$0100,$0101,$0000,$0001,$0100,$0101
dw $0010,$0011,$0110,$0111,$0010,$0011,$0110,$0111,$0010,$0011,$0110,$0111,$0010,$0011,$0110,$0111
dw $1000,$1001,$1100,$1101,$1000,$1001,$1100,$1101,$1000,$1001,$1100,$1101,$1000,$1001,$1100,$1101
dw $1010,$1011,$1110,$1111,$1010,$1011,$1110,$1111,$1010,$1011,$1110,$1111,$1010,$1011,$1110,$1111

packed_to_planar_hi:
dw $0000,$0000,$0000,$0000,$0001,$0001,$0001,$0001,$0100,$0100,$0100,$0100,$0101,$0101,$0101,$0101	//DCBAdcba > ---D---d---C---c
dw $0000,$0000,$0000,$0000,$0001,$0001,$0001,$0001,$0100,$0100,$0100,$0100,$0101,$0101,$0101,$0101
dw $0000,$0000,$0000,$0000,$0001,$0001,$0001,$0001,$0100,$0100,$0100,$0100,$0101,$0101,$0101,$0101
dw $0000,$0000,$0000,$0000,$0001,$0001,$0001,$0001,$0100,$0100,$0100,$0100,$0101,$0101,$0101,$0101
dw $0010,$0010,$0010,$0010,$0011,$0011,$0011,$0011,$0110,$0110,$0110,$0110,$0111,$0111,$0111,$0111
dw $0010,$0010,$0010,$0010,$0011,$0011,$0011,$0011,$0110,$0110,$0110,$0110,$0111,$0111,$0111,$0111
dw $0010,$0010,$0010,$0010,$0011,$0011,$0011,$0011,$0110,$0110,$0110,$0110,$0111,$0111,$0111,$0111
dw $0010,$0010,$0010,$0010,$0011,$0011,$0011,$0011,$0110,$0110,$0110,$0110,$0111,$0111,$0111,$0111
dw $1000,$1000,$1000,$1000,$1001,$1001,$1001,$1001,$1100,$1100,$1100,$1100,$1101,$1101,$1101,$1101
dw $1000,$1000,$1000,$1000,$1001,$1001,$1001,$1001,$1100,$1100,$1100,$1100,$1101,$1101,$1101,$1101
dw $1000,$1000,$1000,$1000,$1001,$1001,$1001,$1001,$1100,$1100,$1100,$1100,$1101,$1101,$1101,$1101
dw $1000,$1000,$1000,$1000,$1001,$1001,$1001,$1001,$1100,$1100,$1100,$1100,$1101,$1101,$1101,$1101
dw $1010,$1010,$1010,$1010,$1011,$1011,$1011,$1011,$1110,$1110,$1110,$1110,$1111,$1111,$1111,$1111
dw $1010,$1010,$1010,$1010,$1011,$1011,$1011,$1011,$1110,$1110,$1110,$1110,$1111,$1111,$1111,$1111
dw $1010,$1010,$1010,$1010,$1011,$1011,$1011,$1011,$1110,$1110,$1110,$1110,$1111,$1111,$1111,$1111
dw $1010,$1010,$1010,$1010,$1011,$1011,$1011,$1011,$1110,$1110,$1110,$1110,$1111,$1111,$1111,$1111
Maybe if you turn rendering off while writing tiles it would work.

lda #$80
sta $2100

**insert tile drawing code here**

lda #$0f
sta $2100
Here is a demo of my game that uses the rotation code. If you want to see how fast it can rotate, go to level 2 because there is a pirate with a rotating canon right at the start of the unfinished level. To get to level 2, push start while in level 1.

I also had to do a lot of tricks getting a lot of fluidly animated characters onscreen at once. One trick was predicting when it needs to DMA more than 4kB at once, and delaying the animation of a character by a frame to prevent screen tearing glitches (I actually learned this trick from reading DKC source code). Another trick I use is checking for duplicated sprites.

http://bin.smwcentral.net/u/28835/Alisha%2527s%2BAdventure.zip