Quote:
Originally Posted by
tomaitheous
@stef Ooouuu... :3 I accept the challenge! :D
Guess I'm stuck arguing for the snes (why do I always get stuck on this side? I don't even care that much for the SNES :/ Nobody else ever jumps in for the snes arguments. It's like the SNES dev coders don't care or don't have multiple system coding experience; it's all snes and nothing else matters) and you the MD.
Hehe, i would love to actually talk with experienced SNES coders but i guess i won't find many of them on a Sega forum :p
But i know how much you like the PCE and i guess you tend to defense the SNES (and specially its cpu) because it's somehow derived from the PCE one.
Quote:
I don't doubt at all that the higher bandwidth of the 68k in the MD vs the '816 speed in the snes, has an advantage. But so does the tile format on the Genesis. It's not just cpu bandwidth only. Most compression formats for graphic data (not tilemap, but actual pixel data) are LZ based from that era. It's a know fact that planar never compresses as good as packed pixel format under any LZ variant. It's known that some developers stored graphic data in packed pixel format before compressing with LZxx, and converted the graphic pixel data back to planar afterwards. This definitely adds additional time. And part speculation, but the SNES has more work ram and a slower cpu - so it's possible more data is being uncompressed to compensate for this as well (i.e. more map data and such). But I wouldn't use this directly as an indication of what it's capable for game performance though. It's akin to saying system A is slower or less capable than system B because it has a slower CD drive and thus takes longer to load a level, etc.
Of course having planar pixel storage does not help but that is actually an advantage of the cleaner (imo) architecture of the genesis VDP... I just cannot understand why the SNES still used planar mode when almost all system started to use linear storage. Anyway still a lot of others older system used planar storage (as the PCE if i remember correctly) and none of them suffer from "loading time", honestly i believe the SNES is the only system where we saw that.
Quote:
A lot of PCE games are 'static' looking as well (less flashy), but we know from the higher tier games that this wasn't specifically a hardware or cpu issue. Not everything relative to game design between the snes and md is cpu resource related. I also follow the logic that either 1% or 30%, if you miss the vblank frame marker (over shoot it) for the next frame - it's gonna slow down for that instance. You'd pretty much would have run probing/monitoring for statistical data on both games and check of idle looping and such, and try to see what load the cpu is under on a per frame basis. Though I suspect this has more to do with developers giving themselves head room for the game engine/design relative to processor resource *and* how much cpu resource is available *without* both low level and high level optimizations. Having coded on the 68k; I truly believe that the difference between unoptimized code and optimized code is fairly small in difference compared to the same on the '816 (and 65x in general). I mean, just looking at the ISA of both processors; it's pretty evident.
PCE games are less static than SNES games, and have a faster gameplay too in general. Yet not the level of genesis games :p
But i totally disagree with the difference between unoptimized code and optimized code on the 68k. The 68k have high level instructions, which make programming easier but you want fast code, you avoid these high level instruction and tend to use registers only.
Taking a piece of code in C. The difference between optimization level 0 and level 1 give you already a x3 / x4 factor in speed.
And in ASM you can at least gain a x2 factor over the C optimized code.
Quote:
Sorry. 70% is near complete optimized code on the '816 architecture. I'm comparing ballz to the wall optimized crazy ass 65x related style code to the 68k. And outside of a very-very few developers, you're just not going to see such written code on the snes. So, I was thinking in terms of what's capable rather than what developers would actually do. There's a difference.
That does count...
Quote:
65x or '816 code bloats fairly easily under optimization and becomes rather complex and convoluted. Neither of those attributes were desirable by developers back in the day. Of the hacks I've done and just tracing through code in general, I've seen a lot of poorly unoptimized 65x based code. So I guess my statement is a little misleading ;) Or rather, how I would code it VS what would be the norm.
I can admit is probably a more difficult task to optimize code for 65xxx but then, that is for me a counter argument for that CPU.
Still i do not see many rooms for optimization on this CPU (i will give further details later in the post).
Quote:
I disagree on this point. While the 68k has powerful ISA, the instruction execution times are slow. This balances out, but doesn't make for great optimization techniques IMO. At least, nothing compared to some insane 65x/816 related optimization. I've said this quite a few times before, coding on the 68k is like a dream. I mean, the ISA is so powerful and easy, that it almost feels like a higher level language compared to the 65x. The cost of time to performance ratio is much better on the 68k and if I was a system designer back then, I'd choose the 68k as well - specifically for developers.
At least we agree on this last point :p
Quote:
But that doesn't negate what a processor is capable, regardless of the... byproducts (complex code, bloat, etc). Other than 65x design being that the ISA is minimal in functionality but very fast execution wise (which allows you to optimize specifically for a task by reducing redundancy or unneeded steps, compared to complex instructions - because sometimes you just don't need the additional advantages of complex instructions), but that the saving grace of the 65x.... is LUTs. Look up tables, look up tables, look up tables. 65x architecture has fast and free (cycle wise) indexing. Especially for layering/embedded tables. It's pretty much the magic elixir for the 65x. It's much faster to do on the 65x arch than the 68k. So you have the benefit of simple but fast instructions and optimizations related to that, and fast access to precalculated that allow you to optimize for a specific logical situation/problem that the more limited ISA doesn't directly support (or it might, but LUTs are faster). Of course the side effect is bloat code/data(LUTs).
Matter of fact, I pretty much stopped using indexing on the 68k because it was slower. Overall, it's faster just to manually add the base address to the index address in the address register at prep time. And especially for sequential access since the 68k supports self incrementing/decrementing. Indexing is almost useless on the original 68k IMO.
Well, it is here that i don't follow you... Maybe the 68k is not fast for LUT but it is not that slow depending what you intend to do, let's take a simple example i used in my video decoder :
move.w #32(%a0,%d0.w),(%a1)+
this instruction is indeed slow : 18 cycles
But it does read a word from the look up table and store it in memory (with destination incremented).
How much that would take on the 65816 cpu to do the same ?
I don't know well the 65816 but i think that give something like that :
LDA DP,X (4+1)
STA tab,Y (5+1)
INY (2)
(Sorry for the syntax, by DP i mean DirectPage reference)
5 + 6 + 2 --> 13 cycles
Not that much less, and that is, with lookup table location and size limitations...
and we arrive to 26 cycles versus 26 cycles when it comes to dword transfer.
For byte we have 11 cycles for the 65816 versus 18 cycles for the 68000 but honestly we always optimize the code to gain benefit of free word transfer as the 68000 is a 16 bits cpu.
So the 68k is not that slow after all but you may have better examples than this one.
Something really annoying with the 65xxx serie is that you often need to do severals instructions for basic operation.
Even a simple addition need almost time 2 instructions (CLC ADC), that is unbelievable that ADD do not exists.
When you use a long operating instruction on the 68k, you generally need 2 or 3 instructions for the equivalent on the 65816.
And i do not speak of the advantage of having MUL and DIV instruction (where you have to use special IO on SNES).
Quote:
It's a possibility that i might, somewhat.. about the 68k. I mean, I've coded on the 65x for like 6+ years and only a year on the 68k. 65x optimization is neither intuitive nor apparent. After 1 year of coding on the 65x, I thought I knew everything there was to know or how to do. Boy, was I wrong. Once I found a couple of forums from some die hard 65x older timer software programmers, did I realize that I didn't actually know shit. Optimization was a whole new level (I don't wanna brag or anything but... I'm pretty damn good now :D Maybe not the best, but I'm up there... somewhere ;) ). I took this into consideration when after getting used to the 68k and honestly.. I didn't find anything near that level. I did find a couple of clever optimizations (props to Chilly Willy, he's an old timer 68k pro and I learned a few tricks from his posts across a couple of forums), but almost all of the optimization on the 68k *is* intuitive IMO.
I agree with this fact, 68k instruction level optimizations are easier to get but what is interesting is the algorithm level optimization, you always try to use registers as much as you can. The 68k offers that possibility which you don't have on the 65x. On a 65x each operation has to be done in memory as you just cannot operate on register, all optimizations seems to fit in "trying to get all working data in directpage area"...
Honestly i don't consider myself as a 68k guru, far from it actually ! I coded on it for severals years but only in small period, but i do have experience in many others CPU assemblers and i hope it helps. Considering the bad apple demo, i could do some parts even faster but i didn't needed to... and a real 68k guru could do better !
Quote:
I think the only serious optimization techniques that I saw, were code lists or precalculated code (that sounds funny. I forget the official term. It's code written a dozen or more times over, with slight changes. But can be crazy like in the hundreds - probably machine generated. And you JSR to the specific 'list' from a table that's determined by some sort of data sets). Though that's not exclusive to the 68k, but I seen it mentioned much more for it than on any 65x forums.
Are you talking about generated code ? you can also use self modified jump table to do smart variable length code.
What is nice with 68k is that unrolling could help a lot in a simple memory copy method, and here a 65x won't have any chance...
Quote:
I don't think I'm over estimating the '816 though. I didn't really know what to think of it until I started writing code specifically for it, even though I had cut my teeth on the 65x many years before. I think maybe I just see what the '816 is capable of/potential, in what I would do with it, rather than what's actually consider practical and normal written code.
Not just 8bit operations (putting the cpu regs in 8bit mode to reduce cycle times for specific code, though kind of rare as you have overhead of set and clearing said register 'width' flag), but 16bit and 32bit operations too. Despite the 8bit data bus (man, this chip would scream with a proper 16bit data bus), cycle wise I was able to get faster small chunks of code out of the '816 than the 68k. Like for instance 16.16 fixed point addition (32bit adds) for objects, it was faster '816. I wrote it multiple ways too (static to indirect, register to indirect, indirect to indirect, etc) because sometimes what looks like faster code (and is) can also be a useless other than for example (and out of context). I also did the same for object to map collision detection. Though the code/routine was a little more modular in that you JSR'd to it with preset data loaded for it to use and this was a little slower on the 68k because JSR and Return instructions are pitifully slow IMO, but even without that edge the '816 version was quite a bit faster... cycle count wise (and on the 68k, I reserved three regs that had fixed values for the routine that would be called multiple times. That kind of optimization has to be weighed more carefully in real world examples IMO. You can only do that so much/so far with multiple routines that run from a parent routine). Anyway, the base cycle count (without JSR/RTS) was something like 63 cycles on '816 and 136 on 68k. Though my calculations didn't include the wait-states on work ram cycles of the '816 instructions, so it'd be a little slower than that. There's no doubt in my mind that a '816 running at the same cycle speed as a 68k would be faster. The SNES didn't need a 68k, it just needed a redesigned '816 with a proper bus design instead of the stupid multiplexed upper address bits being on the data bus (wtf? it's custom chip already; not a stock '816. Why keep the shitty bus design???). That alone would have cut the rom speed requirement in HALF and allow the '816 to run double the clock speed with the same rom speeds. Hell, even the original 65x chips had really tight timing requirements for devices on the bus without the address multiplexing BS of the '816 forcing twice the speed, was fixed in the huc 6280 design for the better (120ns roms are good to go on the 7.16mhz cpu. That's too slow for a 6502 or 65c02 at the same speed). But I digress...(seriously).
Fortunately cycle wise you can have the advantage in a lot of case ! We are talking about a 2.6 / 3.6 Mhz 65816 compared to a 7.7 Mhz 68000.
Still i don't see how you can do faster 16.16 fixed addition with a 65816 compared to a 68000 where a 32 bits addition is basically 6 or 8 cycles ! I guess i miss something.
If you want you can past of specific portion of code you are very proud of for the 65x cpu then i could try to do a 68000 version which arrive to the same result (but which may use a different approach, more 68k friendly). Then i can do the contrary... I think you are too used to the 65x0 coding logic and probably do not see the real potential of the 68k.
The contrary is true at least... i have many assumptions on a CPU which cannot operate on registers :p
I already made some code for the 6502 cpu (VIC 20) but it was long long time ago but already i was quite disappointed by the limited ISA (at this time i was more used to the x86 cpu which are not really nice overall).
At least there are tons of example of what can achieve a 68000, we just have to look the amiga /atari 16 bits computers demoscene... unfortunately we don't have the same for the 65816 cpu.
Quote:
I don't see it clearly as that. If you mentioned something like Sonic, sure (and *not* specifically because sonic runs fast :p ). But when I look at AB&R, all I see it 'tech demo' with some gameplay thrown on top. Look at the levels, other than enemies - there isn't much going on at all. Level maps consist of single repeating screens, scrolling over and over (nothing at all like Sonic map system or even other games on both systems from that generation). No decompression, etc. Hell, some of it looks like you write the map data once and never touch it again till the transition (next level or boss).
Indeed the background is not what we should watch for in this game, but some at least use nice perspective effect, but that is pretty easy, given the VDP capabilities...
Quote:
The boss screens are usually no different (usually static too). And a lot of levels have minimal to none (some levels have none at all) for map collision. In my experience, object to map collision is much more expensive processor resource wise than object to object collision detection. If anything, I can see the SNES doing the game as is... in the right hands of course. It's a typical 'tech demo' game; looks flashy and incredible... like a professional magician. Also hides it's limitations fairly well with distractions... also like a professional magician.
Honestly no, i don't think we can do that :
https://www.youtube.com/watch?v=kmVPo_gg6b4#t=175s
Seeing here the number of animated sprites, collisions to handle, different explosion particles animation... typically what we miss in a SNES game.
But he, i guess we won't change our mind on that point :p
Quote:
This is completely from memory (which fails me more and more as I get older :p) : The floor is a single pic/image; 15 colors or less. There are two subpalettes. One subpalette assigned to the 'pic/image' or background layer has the red tones in specific checkered areas and white tones in opposite ones. The second subpalette is the reverse. So you index the tiles twice, once (and complete) in on part of the tilemap using subpalette A and again in another part of the tilemap (complete) using subpalette B. Then you use an hsync routine (or in the MD case, fine scroll table) and set on a specific scanline which part of the tilemap to show. This allows you to 'animate' the checkered board while the 'face' remains static. Then of course you have warping/stretching logic applied on top of it. It's pretty clever, but pretty simple; some of the best looking effects usually are.
Oh indeed that would make it ! I like how some neats tricks can look that good :)
That does need a bit of calculations to update both the horizontal scroll as it does requires H Int for vertical scroll change on a line basis. Very well done :)
By the way, are you still working on your BA PCE version ? I would really like to see what can be done on the PCE for that :)
But i guess you will be somehow busy with you motorcycle license now :)