Just to reply you Tom about the purpose of this topic: my point of view it's just to expose why the 6502 (and derived) is a *bad* CPU :p It may sound a bit aggressive and sorry for that but I often read 6502 is a good and powerful CPU as instructions execute in few cycles compared to the 68000 for instance. For me that is a total non sense and i want to clarify why :)
And so here it is my long reply to your last big post :
Quote:
Stef: I would have posted much more, but this isn't a CPU VS thread. I tried to keep it minimal. Sorry for the late reply (school and such).
Ok now we have a dedicated topic, you can take your time :)
Quote:
Actually, I did miss count the cycles for GBCPU (got the cycle and byte listing backwards).
...
It's 13 cycles (instead of the 10 I posted), or 23 cycles in 6502@1.79mhz cycle times. And I didn't think that worked against it originally; I thought it showed that they were more comparable (though it looks a little worse now)
Indeed it's even worst, as the original was 13, rounded to 4 cycles give 16 cycles as so ld a,(addr) is 4 cycles...
Quote:
Actually, it's a common operation. It was a simple and common example, that allowed me to extend into something more complex, but still common (I mean, array access is a common thing). It was to show a glaring weakness of the processor; It's not that everything needs to pass through the A register for operations, but that those logical operations supporting the A register - don't have direct memory addressing modes (and to a more important extent, direct memory+indexing). Just look at add, compare, sub; only three addressing modes on the GB-CPU for A. Even loading/storing A register (which gets the most use, since everything passes through it), only has three addressing modes. The original 6502 has 8 addressing modes for instructions that effect the Acc reg. Those additional addressing modes can translate into not having to load another register just to complete an operation. That's an immediate savings, but it also plays back into the larger design (and overhead). The GBz80 is weak in addressing modes overall (other instructions) by comparison.
Again that is a biased comparison for me, you definitely can't compare CPU on that type of operation or more precisely of an isolated operation of that type. With a 6502 you tend to do it *a lot* as you don't have the choice: you work with memory and only memory. With others CPU you do it less, far less actually... in my 68k code i very rarely use that piece of code as for these CPU it's indeed quite inefficient to do that (as with the GB cpu). Honestly I think you are terribly biased to the 65x habits :p And you just prove it with the A register thing, you said the problem is the lack of addressing modes with the A operation, that is your point of view because you are used to have more addressing modes on 65x CPU and actually on the classic Z80 you have access to more indexed mode (with IX and IY registers) but it does not help at all, i never use them.
It's just a part of the problem and giving the processor the real biggest issue is that you can't do inter registers operations so the advantage of having many registers (compared to other 8 bit CPU) is somehow wasted. And i don't think the GB cpu is that weak in addressing mode, the simple but efficient (HL) indexing is almost enough for everything. It would be better to have 2 full 16 bits indexing registers but so you can use (C) indexing with last page for that then in this case i honestly think the indexing can be better than in 65x CPU... you also have LD HL, SP+x which is very useful for local variable accesses or LD A, (d8) for fast last page access (equal 6502 ZP access).
Quote:
Yeah, I saw those. At first glance they look useful, but then you see that addressing mode (GB zero page) isn't used for logical operations. They only save one byte or cycle, here or there. LDH would have been useful if it could load to other registers (again, effective address mode problem). LD (C) is interesting, but it's really just indexing into GB zero page: ld ($ff00+c) = lda zp,x. Not indirection as it might appear and would be more useful; lda [zp,x].
Indirection ?? Indirection is a weird feature of the 6502 CPU as it does not have plain 16 bits index registers, it's totally useless on a Z80 based CPU as it is on moderns CPU and i believe that even on 6502 you try to avoid to use it as it's very slow. On a classic CPU you just load the address register then uses it to access value in memory...
Quote:
Indexed? Really? Sure, it's been years since I've coded on the processor - but just looking over it now, I see absolutely nothing that gives it the advantage of indexing. Matter of fact, it has no indexing what-so-ever in this context (SP+8bit offset is technically indexing, but that's hardly a qualifier).
But LD A,(HL) *is* indexing ! Of course it is just direct indexing without any offset (absolute or relative) but still it is indexing. The idea of indexing is "iterating on table" and with the new HL post increment / pre decrement the GB CPU is even faster than classic Z80.
You can now do fast memory clear operation with LD (HL+), A
2 (or 8) cycles per byte cleared... you won't be able to do as fast with the 6502.
Quote:
That's because you don't know the processor. I never said it wasn't convoluted or complex (how ever that translates to you), but it most definitely has its strengths (that are really underestimated). You're a 68k guy, so you see everything thought that lens. Until you've cut your teeth on the 65x, you can't accurately speak on its strengths and weaknesses. I'm proficient at 65x, not because I love the processor or ISA, but because I understand where the strengths lay. I coded for many other processors, including the 68k. The 68k is probably one of my favorite ISAs, simply because it's dead easy to code for (and get good results). Matter of fact, almost all my macros for 65x assembly - resemble 68k instructions (including labeling ZP areas with 68k register names: A0-A7, D0-D7, etc). But the original 68k isn't the end all, be all. It has its deficiencies, where the 8bit processors can creep up on it (clock for clock). And by 8bit processors, I mean specifically mean 65x, 6809, and 6309.
Honestly maybe i don't know enough this CPU but for sure you really don't know enough others CPU or you are just too biased by your love for this CPU :p I started assembly programming 20 years ago with a... 6502 (on a VIC20) :p Ok i did not made much with this CPU but still i think i have a good experience with CPU assembly in general, i made some 6809, 6800, Z80, Saturn CPU (not speaking of sega saturn CPU but of a very weird and neat 4 bits CPU), ARM, SHx, 680x0, x86, microchips and others i can't remember. I believe i have a good overview of CPU design in general. Also i never said the 68000 is perfect, it could be improved in some ways (with some extra instructions for instance) but still i think it has a very elegant and efficient design with a very powerful and balanced ISA which is definitely not the case of the 6502. Again you can't compare CPU clock to clock (and honestly i'm very surprised you make that mistake), that is definitely not interesting... In this case a Pentium 4 is simply weaker than a pentium 3 ?? Of course not, you have to take others aspects in account, as the memory usage efficiency.
Quote:
I think the code below speaks for itself:
Code:
;24bit add for X and Y. 16:8 = 16bit for whole, 8bit for fractional
ldx #$xx ;2
jsr AddVelocity ;6
AddVelocity:
lda x_float,x ;4
clc ;2
adc x_float_inc,x ;4
sta x_float,x ;4
lda x_whole.l,x ;4
adc x_whole_inc,x ;4
sta x_whole.l,x ;4
lda x_whole.h,x ;4
adc x_whole_inc.h,x ;4
sta x_whole.h,x ;4 = 38
lda y_float,x ;4
clc ;2
adc y_float_inc,x ;4
sta y_float,x ;4
lda y_whole.l,x ;4
adc y_whole_inc,x ;4
sta y_whole.l,x ;4
lda y_whole.h,x ;4
adc y_whole_inc.h,x ;4
sta y_whole.h,x ;4 = 38
rts ;6
;38+38+6+6+2 = 90 cycles
That's a
simple array of 256 objects, all 24bit in length: 16bit whole part (signed) and 8bit fractional part. Hell, that's not even really optimized (only optimized for split tables). I added "ldx #$xx" for simplicity overhead, but it would be much more flexible than that (an index into an array that could skip non entries, which translates into not having to move any data around or resorting an array). This was taken from the 68k vs 65x discussion with Steve Snake. Cycle for cycle, this is even faster than his fully optimized 68k example (32bit adds with auto incrementing). Does my simple 'add' and add with indexing seem more clear now?
Ok, array of 256 to start with, if you want more, you have to add code... anyway, it's just normal that "cycle for cycle" it is faster on the 6502, it *should be faster* on 6502 as this CPU generally work at 1/4 of the clock speed of a 68000 ! It's what i'm trying to explain you :) this CPU cannot work at fast speed just because it accesses BUS at each cycle internal cycle where the 68000 accesses BUS each 4 cycles. That is a *huge* difference, if we count in BUS cycle the genesis 68000 actually runs at.. 1.92 Mhz ! And that would be a more fair "clock to clock" comparison and in this case the 65x CPU are just so bad that it is ridiculous. But still, even trying to compare with the code you posted, i accept the challenge :) Here it is :
Code:
lea AddVelocity, a6 ; 8 (using xxxx(pc))
lea Ret, a5 ; 8 (using xxxx(pc))
...
lea xxxx(a3), a0 ; 8
jmp (a6) ; 8
Ret:
...
AddVelocity:
move.l a0, a1 ; 4
move.l (a0)+, d0 ; 12
move.l (a0)+, d1 ; 12
add.l (a0)+, d0 ; 14
add.l (a0)+, d1 ; 14
move.l d0, (a1)+ ; 12
move.l d1, (a1)+ ; 12
jmp (a5) ; 8
88 cycles for the AddVelocity method.
16 cycles for to pointer set and call, you also have 16 cycles for base initialization but in multiple calls you will have it only once. Ok if we compare cycle by cycle i have 104 cycles where you have 90 cycles. I used register for fast calling / return as the 68k is slow for call / return sequence but in real situation you will use that type of optimization when you call billion of time a small routine as this one. So of course in term of number of cycle you have the edge but again a 65x CPU is not meant to run at same speed than a 68000, and even in this particular case you can see you don't have a big advantage in number of cycles compared to the 68000.
Now let's see how you would normally do it on a 68000, just assuming you need to update the position on many objects :
Code:
lea xxxx(a3), a0 ; 8
move.w #xx, d7 ; 8
.loop:
move.l a0, a1 ; 4
move.l (a0)+, d0 ; 12
move.l (a0)+, d1 ; 12
add.l (a0)+, d0 ; 14
add.l (a0)+, d1 ; 14
move.l d0, (a1)+ ; 12
move.l d1, (a1)+ ; 12
dbeq d7, .loop ; 10
So now we have 90 cycles by object proceed and i could probably optimize it by using movem sequence but i don't care too much here. With the 65x you will experience problem if the array cross 256 bytes bounds so you have to prepare / split arrays (and so spent some extra cycles here and there).
Quote:
Because they went with a stock 65816. The stock chip was supposed to be drop in replacement/upgrade to the original 6502. That means in order to access a larger address bus, they had to multiplex the upper 8bits on the data bus (intel had been doing this before). That's one (main) reason for faster rom requirements. They could have cut that nearly in half, had they went with a custom 65816 package that eliminated that part (there's no need for it on the SNES). Later snes carts has custom 65816's on them (SA-1) at speeds of ~10mhz.
The fast rom is needed because of fast bus cycle on 65x CPU and just for this reason... I don't even speak about having a real 16 bits bus on the SNES, it would have raised costs and required modifications of the standard 65816 package.
SA-1 came *very late* (1995) and was very expensive ! I remember all games including SA-1 chip were quite expensive on SNES. And yet, the SA-1 include a tiny 2 KB of fast ram (to feed the fast 65816) which was directly integrated into the die... nothing comparable to what the SNES could have when it was released.
Quote:
The C64 has local mapped video memory. The 6809 is just like the 65x, memory bus hog, and the CoCo systems had local mapped video memory too. I believe it's the same for Apple II computers as well (65x based).
Yeah and because of that the CPU was running at only 1Mhz or even less !
Quote:
I was able to write equivalent code on the 65816 (fast rom: 3.1mhz) that directly compared to the 68k, and even at the slower clock speed the difference wasn't half - it was much closer to 70% mark. If you took the time to look at what cycles per byte on the bus, then you'd see that you would get a near 50% increase in speed (both ZP and rom access). The '816/65x does have overlap; it can fetch a byte from the bus while processing an operation. The current snes cpu at the current fast rom speed with a 16bit data bus, even with wait states on wram, would easily be on par with the 7.67mhz 68k. Even Steve Snake agreed with me that even with the 8bit data bus, simply running the 65816 at the speed of 7.67mhz would smoke the 68k at the same speed (in the context of these consoles).
Again you are trying to compare these CPU at same speed which is definitely not fair...
Even at 3.58 Mhz the 65816 already ask faster memory than the Genesis 68000, it's why the fast rom were not available immediately (too expensive).
But ok, for me at ~3.1Mhz the 65816 is only about 50% of the performance of the 7.7 Mhz 68000, maybe a bit more but definitely not 70%. And with a 16 bits you only make it a bit faster, not that much ! Or you need to heavily modify the CPU architecture so it accesses BUS only half time (each other cycle) and have 2 bytes queue so you can increase its speed greatly by just clocking it at double clock rate :) But unfortunately the CPU is based on 65x architecture so that was not a possible solution.
Of course you can always find piece of code that fits better to 65816 but generally the 68000 will be faster and in worst case the 68000 can be *a lot* faster (multiplication / division, 32 bits operations, memory set/copy...) ! I'm ready to compare whatever piece of code you want to throw :p And i don't compare in cycle but just in CPU efficiency, already accepting fast rom speed is "kind" as the 68000 use slower speed memory to work at 7.67 Mhz :p
Quote:
Yeah? So do RISC processors. The original 68k has both powerful instructions and addressing modes, but it also has long execution times on average; it's got some fat with that muscle. The 65x series is lean and fast.
The 68000 was designed to work that way and it was efficient, again everything lead to the whole architecture. When you have a 2 Mhz memory you have choice between a 8 Mhz 68000 or a 2 Mhz 65816, i think the choice is quite easy here ! The 68000 will give *a lot* more.
Then you have costs, that is the only big advantage of the 65x CPU, it was very cheap... so you could eventually use faster memory (as does the PCE) but that does not mean the 6502 is better for that reason. The CPU itself is crap but its low price allow to somehow compensate it.
Quote:
The PC-Engine would like to say hello.
Haha, yeah... can you recall me the ridiculous amount of main memory we have in PCE ? ;) As much than in the Sega Master System which is out 2 years ago and half of its price. The Hu-Card capacity were also pretty limited for that simple reason too, very expensive fast rom required by the fast Hu6280. Hopefully NEC was producing the memory itself so they could afford a such fast memory at reasonable cost but still it was (imo) something really hurting in the general design of the system ! The PC-Engine was almost the same price as the Sega Legadrive even if the later was much more advanced so definitely for me the 6502 is a bad CPU in almost every aspect... Very painful to code with, inefficient in term of memory usage, not usable with high level compiler. Its only advantage : price. But of course if it's cheap you have to pay it somehow ;)