明明就是Blog: 5月 2008

2008年5月9日星期五

Sound and Vision: A Technical Overview of the Emotion Engine

By Jon Stokes | Published: February 16, 2000 - 07:00AM CT

Preface

I'll spare you the obligatory opening fluff paragraph that goes something like "when Sony first announced the Emotion Engine...," and I'll cut right to the chase.  The Emotion Engine is weird.  It's so weird, in fact, that it took me quite a while to figure out how it works, and as a result, it's going to take me quite a while to explain it. The result of this is that I want to approach this topic in two parts. 
The first part of this article will not be as technically granular as most of my previous work because I want to provide a pertinent overview and a context for understanding the Emotion Engine in detail, without addressing some of the more complex architectural issues. For the technically uninitiated, the first part will suffice in bringing you the mojo you need (and hopefully wetting your whistle for more). With the foundation laid in part I, the second part of this article will, then, delve into the depths of the Emotion Engine. This second part is probably less accessible than the first, because to understand it, you'll need to be familiar with CPU architectural concepts like pipelining, VLIW, SIMD, instruction latency, throughput, etc. If you are not familiar with these terms, I'd suggest checking out some of my previous work. [Update 10/05/00: I've since written a system-level comparison of the PS2 and the PC, which should be a bit more accessible than this article. I'd suggest reading it first.]
Also, a disclaimer before I get started. The literature on the PS2 offers conflicting numbers for the sizes of the various caches in the Emotion Engine. These numbers are usually the last to be fixed at production time, and I'm not sure of the latest ones. However, I used the numbers that I thought were most recent. Feel free to correct me if you know otherwise.
 
Part I: General Playstation 2 OverviewThe bulk of this article will deal exclusively with design and function of the heart of Sony's Playstation 2: the Emotion Engine. However, because of the way Sony designed the PS2, it's not really possible to look only at the Emotion Engine, and ignore the rest of the system. So we'll start out by looking at the system as a whole, and then we'll narrow the discussion to the Emotion Engine. 
We have to look at the Emotion Engine in the context of the overall design of the PS2 because, unlike a modern PC's CPU, the Emotion Engine is not really a general purpose computing device. The CPU of a PC is designed from the ground up to run SPEC benchmarks...er, application code as fast as possible. Since application code comes in a wide variety of forms and performs a wide variety of functions, CPU hardware must be general enough to give acceptable performance on almost anything a coder throws at it. In a PC system, this general-purpose CPU is augmented by special purpose hardware for things like video acceleration, network communications, sound processing, etc..
The PS2 is in a slightly different situation. The PS2's designers had the luxury of designing hardware whose main purpose is to run one type of application extremely well: the 3D game. Sure, the PS2 can run web browsers, mail clients, and other types of software, but that's all secondary. The main thing the PS2 does is 3D gaming, which means generating the kind of immersive sound and vision that places you in a virtual world. Nearly all of the PS2's hardware is dedicated to providing some specific portion of that audiovisual gaming experience. 
So just how does the PS2 generate 3D graphics and sound? Let's take a look at the main parts of the PS2.
   
  
In the above picture, you can see that there are four main parts to the device. Let's look at the one at a time.  The I/O Processor (IOP) handles all USB, FireWire, and game controller traffic. When you're playing a game on the PS2, the IOP takes your controller input and sends it to the Emotion Engine so that the Emotion Engine can update the state of the game world appropriately. The Emotion Engine is the heart of the PS2, and the part that really makes it unique.  The Emotion Engine handles two primary types of calculations and one secondary type:
Geometry calculations: transforms, translations, etc. 
Behavior/World simulation: enemy AI, calculating the friction between two objects, calculating the height of a wave on a pond, etc. 
Misc. functions: program control, housekeeping, etc.
When all is said and done, the Emotion Engine's job is to produce display lists (sequences of rendering commands) to send to the Graphics Synthesizer.  The Graphics Synth is sort of a souped-up video accelerator. It does all the standard video acceleration functions, and its job is to render the display lists that the EE sends it.  Finally, the Sound Processor is the "soundcard" of the PS2. It lets you do 3D digital sound using AC-3 and DTS.
The Emotion Engine is sort of a combination CPU and DSP Processor, whose main function is simulating 3D worlds. So before we discuss the Emotion Engine's architecture in detail, we should talk a bit about DSP (Digital Signal Processing) and 3D graphics.

3D rendering and DSP basics

Background: 3D rendering and DSP basicsA DSP processor, in a nutshell, takes in massive amounts of input data and performs repetitive, loop-based calculations on it to produce massive amounts of output data. Speed and bandwidth are of the essence in digital signal processing. One of the most useful and important features that modern DSPs have is the ability to do a MAC (multiply-accumulate) in a single cycle. A MAC is used in a variety of vector calculations, the most common of these being the dot product. The dot product involves summing the products of vector element pairs, and it requires a series of MACs to calculate. The Emotion Engine has a total of 10 FMACs (Floating-Point Multiply-Accumulators), each of which can do one 32-bit floating-point MAC operation per cycle. If you were wondering what's behind those outrageous polygon counts that Sony publishes for the PS2, now you know; the PS2 can do a lots of MACs, very, very quickly.
The second big requirement for a DSP processor is memory bandwidth and availability. The PS2 is full of small, strategically placed caches (the SPRAM, VU instruction and data memory, etc.) that can be accessed in a single cycle. More importantly, the SPRAM (Scratch Pad RAM--more on this in a moment) interleaves those single-cycle CPU accesses with slower memory bus DMA accesses, so the SPRAM doesn't get tied up by the slower main bus. Finally, the Emotion Engine contains a 10-channel DMA controller (DMAC) to manage up to 10 simultaneous transfers on the Emotion Engine's internal 128-bit, 64-bit, and 16-bit buses. With the DMA controller directing all that bus traffic between the various components and types of memory, the other components are free to do their thang without having to manage data transfers themselves. 
The final bit of background info that's pertinent to our project involves 3D rendering. This isn't the place to discuss 3D rendering basics or anything like that (and I'm not really the guy to discuss them either), but there is one aspect of the rendering process we should cover. The Graphics Synthesizer on the PS2 takes data from the Emotion Engine in a very specific form: the display list. The display list is a sequence of drawing commands that tells the GS which primitive shapes to draw and where. A typical display list contains commands to draw vertices, shade the faces of polygons, render bitmaps, etc.--basically, the commands required to actually draw the virtual, 3D world of the game. The Graphics Interface unit (GIF) can take multiple display lists from multiple units inside the Emotion Engine and combine them to allow the Graphics Synth to produce composite images. Or, it can arbitrate between the lists to decide which ones get drawn and when. 
Since the Graphics Synthesizer eats display lists, the Emotion Engine's main job is to feed those lists to it. The Emotion Engine's various subunits can operate independently of each other in order to asynchronously generate multiple display lists to send to the GS. Since the Graphics Synth's interface unit, the GIF, handles,  tracks and manages all of these display lists, Emotion Engine doesn't really have to waste computational resources or internal bus bandwidth keeping track of them. Its various sub-units just concentrate on cranking them out and sending them over a dedicated, 64-bit bus to the GIF. Render 'em all, and let the GIF sort 'em out.


The Emotion Engine: Basic ArchitectureAs was stated above, the Emotion Engine's primary piece of output is the display list. Generating those display lists involves a number of steps besides just the obvious geometry calculations. For instance, if the software you're running is a racing game, then you've got to first calculate the virtual friction between a car's tires and the road (among other things) when the car turns a corner before you can draw the next scene. Or if the game is an FPS, you have to run the enemy AI's path-finding code so you'll know where to place them on each frame. So there's a lot of stuff that goes on behind the scenes and affects the output on the screen. All of this labor--geometry calculations, physics calculations, AI, data transfers, etc.-- is divided up among the following units:
MIPS III CPU core   
Vector Unit (which is actually two vector units, VU0 and VU1).   
floating-point coprocessor, or FPU   
Image Processing Unit (The IPU is basically an MPEG2 decoder with     some other capabilities).   
10-channel DMA controller     
Graphics Interface unit. (GIF)   
RDRAM interface and I/O interface (for connecting to the two     RDRAM banks and the I/O Processor, respectively))
All of the above components are integrated onto one die and are connected (with the exception of the FPU) via a shared 128-bit internal bus. 
 
As was noted in the bullet list, the VU can be further divided into two independent, 128-bit SIMD/VLIW vector processing units, VU0 and VU1.   These units, though they're microarchitecturally identical, are each intended to fill a specific role. Toshiba, who designed the Emotion Engine and licensed it to Sony, didn't feel that it was optimal to have three pieces of general purpose hardware (a CPU and two vector processors) that could be assigned to any task that was needed. Instead, they fixed the roles of the devices in advance, customized the devices to fit those roles, and organized them into logical units. In that respect, they're sort of like employees who've been grouped together on the basis of talent and assigned to teams. Let's look at the division of labor amongst the components:
CPU + FPU: basic program control, housekeeping, etc.   
CPU + FPU + VU0: behavior and emotion synthesis, physics     calculations, etc.   
VU1: simple geometry calculations that produce display lists which     are sent directly to the Graphics Synth (via the GIF).   
IPU: image decompression.
Of the above "teams," 2 and 3 are the ones I want to talk about here. 
The CPU/FPU/VU0 teamThe FPU and VU0 are coprocessors for the MIPS III CPU core. This means that the CPU, the FPU, and VU0 all form a logical and functional unit (or a team, if you will) where the CPU is the primary, controlling device and the other two components extend its functionality. This CPU/FPU/VU0 team has a common set of goals: emotion synthesis, physics, behavior simulation, etc. I'll be going into much more detail on this collaboration in the second half of the article.
There are two main things that bind this team together and allow them to work very closely with each other. The first is the way they communicate with each other: VU0 and the FPU each have a dedicated, 128-bit coprocessor bus that connects them directly to the CPU. That way, they don't have to talk over the main, shared bus. The dedicated 128b bus also gives the CPU direct access to VU0's registers, and allows VU0 to fill its role as a standard, MIPS III coprocessor.
The other important component that ties the CPU core and VU0 closely together is the Scratch Pad RAM. The SPRAM is 16K of very fast RAM that lives on the CPU, but that both the CPU and VU0 can use to store data structures. The SPRAM also acts as a staging area for data, before it's sent out over the 128b internal bus. So the SPRAM is kind of like a shared workspace, where the CPU and VU0 collaborate on a piece of data before sending it out to wherever it needs to go next.
The VU1/Graphics Synth teamThe other main team is composed of VU1 and the Graphics Synthesizer (which communicate via the GIF). Just as VU0 has a dedicated bus to the CPU core, VU1 has its own 128-bit dedicated path to the GIF. However, VU1 and the Graphics Synth aren't as closely tied together as are the CPU/FPU/VU0 group. VU1 and the GS are more like equal partners, and one doesn't control the other. You're probably wondering at this point just who does control VU1. This is an interesting question, and we'll discuss the answer when we talk about VU1's architecture in detail.
Putting the teams togetherThough the roles of the components were fixed by the PS2's designers, the overall design is still quite flexible. You divvy up an application's work amongst the teams however you like. For instance, the CPU/FPU/VU0 group can generate display lists and do geometry processing in parallel with VU1, so both groups can send display lists to the GIF at the same time. 
 
Or, the CPU/FPU/VU0 group can act as a sort of preprocessor for VU1. The CPU and co. process conditional branches in the rendering code and load data from main memory. They then generate world information that VU1 takes as input and and turns into a display list.
 
This flexibility allows a developer to customize the process of generating and rendering the 3D environment to suit the needs of the specific application you're working with. 
Now that we've gone over the basics of the Emotion Engine's operation, it's time to get hardcore. For the remainder of this article, I'll go in-depth on the MIPS III CPU core, VU0, and VU1. I'll give you the straight scoop on how these components are designed, and how they're integrated with each other. If terms like instruction latency, pipelining, and SIMD make your eyes glaze over, then you might want to check out here. If, however, you're an architecture enthusiast who eats CPU internals for breakfast, then hang on, because what follows is quite fascinating.


The MIPS III CPU CoreThe MIPS ISA has been a popular one for everything from game consoles to SGI workstations. Check out this page for the rundown on the various products that MIPS has shown up in. Among them are:
Sony Playstation   
Nintendo 64   
Sony's WebTV   
Cassio's Cassiopeia PDA line.   
Sony's AIBO   
Various printers, copiers, scanners, etc.
In short, the MIPS ISA is an industry standard RISC ISA that's found in applications almost everywhere. Sony's MIPS III implementation is a 2-issue design that supports multimedia instruction set enhancements. It has 32, 128-bit GPRs (general purpose registers), and the following logical pipes:
Two 64-bit integer ALUs   
a 128-big Load/Store Unit   
a Branch Execution Unit   
FPU Coprocessor (COP1)   
Vector Coprocessor, VU0 (COP2)
(Here's a shot of the processor block diagram.) The core can issue two 64-bit integer ops, or one integer op and one 128-bit Load/Store per cycle. For the obsessed, below is a handy chart that gives you a breakdown of all the types of instructions that can be issued concurrently:
                  ALU       MAC0       MMI       Branch        COP1 oper.       COP2 oper.     
       ALU       X       X        X       X       X       X     
       MAC1        X       X       X       X       X       X      
       LZC       X       X       X       X        X       X     
       Ld/St       X       X        X       X       X       X     
       SYNC        X       X       X       X       X       X      
       Branch       X       X       X                X       X     
       ERET       X       X        X               X       X     
       COP0
         ld/mov       X       X       X       X       X        X     
       COP1
        ld/mov       X       X        X       X       X       X     
       COP2
         ld/mov       X       X       X       X       X        X     
  
The two, fully-pipelined 64b integer ALU's are interesting, because they can either be used independently of each other (like in a normal CPU), or they can be locked together to do 128-bit integer SIMD in the following configurations: sixteen, 8-bit ops/cycle; eight, 16-bit ops/cycle; four, 32-bit ops/cycle. Pretty sweet. 
To take advantage of the integer and FP SIMD capabilities that COP2 (COP2 = VU0) and the iALUs provide, Toshiba used extensions to the MIPS III ISA that include a comprehensive set of 128-bit SIMD instructions. Here are the instruction types that the CPU core supports:
MUL/DIV instructions   
3-op MUL/MADD instructions   
Arithmetic ADD/SUB instructions   
Pack and extend instructions   
Min/Max instructions   
Absolute instructions   
Shift instructions   
Logical instructions   
Compare instructions   
Quadword Load/Store   
Miscellaneous instructions
The CPU has a 16K instruction cache and an 8K data cache, each of which are two-way set associative. It's also capable of speculative execution using a simple, two-part branch prediction mechanism (a 64-entry BTAC and a BHT). Toshiba didn't waste a lot of resources on branch prediction, because the CPU's pipeline is a short 6 stages. Unlike with Willamette's extremely deep 20-stage pipeline, the penalty for a mispredict isn't too high. Here are the pipe stages:
1. PC select
 2. Instruction fetch
3. Instruction decode and register read
4. Execute
5. Cache access
6. Writeback
Pretty standard RISC stuff. As usual, the execute stage can take a couple of cycles depending on the latencies of the instructions. 
I discussed the SPRAM earlier, but I didn't mention that it has it's own address space that's accessible to the CPU via standard MIPS III Load/Store instructions. These loads and stores are interleaved with the main bus accesses to keep throughput up.
The FPU coprocessor, COP1, doesn't really deserve its own section. It's pretty much a straight-up, floating-point coprocessor that's a throwback to the classic RISC days when the FPU was a separate unit from the core. It executes basic, 32-bit MIPS coprocessor instructions using one FMAC unit and one FDIV unit, each of which is the same as the FMACs and FDIVs in the vector units. I'll talk about what these units do when we get to the vector unit discussion.
As you can tell from the above, the CPU core isn't really all that exciting. The only really cool things in it are the SPRAM and the 128-bit integer SIMD capabilities. Other than that, there's not much out of the ordinary going on. This unit is mostly here to handle program control flow by processing branch commands. It also does other stuff, but it doesn't do any of the real heavy lifting--that's reserved for the vector units.


VectorsBoth VU0 and VU1 are microarchitecturally identical, but they're not functionally identical. VU1 has some extra features tacked onto the outside of it that help it do geometry processing, and VU0 has some features that it doesn't normally use (but that VU1 does). Toshiba did things this way to make the units easier to manufacture. Since VU0 is simpler, we'll start with it first.  Just keep in mind that a lot of what's said about VU0 also applies to VU1.
Vector Unit 0VU0 is a 128-bit SIMD/VLIW design. (If you're confused about the term "SIMD/VLIW," don't worry, so was I at first. We'll discuss what this term means in a special section to follow.) Since VU0 is a coprocessor for the MIPS III core, it spends most of its time operating in Coprocessor Mode. This means it looks like just another logical pipe (along with the integer ALUs) to the programmer. The instructions that make VU0 go are just 32-bit MIPS COP instructions, mixed in with integer, FPU, and branch instructions.  In this respect, VU0 looks a lot like the G4's Altivec unit. Often, in the rendering process, the CPU maintains a separate thread that controls VU0. The CPU places FP data on the dedicated bus in 128b chunks (w,x,y,z), which the VIF unpacks into 4 x 32 words for processing by the FMACs.
 
VU0 has its own set of 32, 128-bit FPRs (floating-point registers), each of which can hold 4, 32-bit single precision floating-point numbers. It also has 16, 16-bit integer registers for integer computation.
Here are the computational units available to VU0 (and VU1):
4 FMACs   
1 FDIV   
1 LSU   
1 ALU   
1 random number generator. 
The first 5 units here, the 4 FMACs and the 1 FDIV, are sort of the heart of both VU0  and VU1 (which are themselves the heart of the Emotion Engine, which is itself the heart of the PS2). So this is where the magic happens. Each of the FMACs can do the following instructions:
                    Floating-Point Multiply-Accumulate         1 cycle       
         Min/Max         1 cycle       
       
 
The FDIV unit does the following instructions:
                    Floating-point Divide         7 cycles        
         Square Root         7 cycles       
         Inverse Square Root          13 cycles       
       
 
The bulk of the processing that the PS2 does to make a 3D game involves performing the above operations on lots and lots of data.
Now, those last three units in my list (LSU, ALU, and RNG) aren't normally shown in most charts as being part of VU0. I suspect this is because they aren't used in coprocessor mode. When VU0 is acting like a MIPS Coprocessor, it only uses the 4 FMACs. "Wait a minute," you're saying, "isn't VU0 always a MIPS coprocessor--you know, the 128-bit dedicated bus and stuff? You went to great lengths to make that point in the first half of the article."  Yeah, I did kind of insist that VU0 is on the CPU's "team," and that they share the same goals, and that it's bound to the CPU, etc.. This is kind of misleading (although I would argue heuristically justifiable), but all will become clear in the final section. For now, just understand that VU0 mostly operates as a MIPS Coprocessor that handles any FP SIMD instructions that show up in the CPU's instruction stream. 
Vector Unit 1VU1 is a fully independent SIMD/VLIW processor that includes all the architectural features of VU0, plus some additional mojo. These additions relate directly to VU1's role as a geometry processor for the Graphics Synth, and they help bind it more tightly to the GS. The primary addition is an extra functional unit, the Elemntary Functional Unit (EFU). The EFU is just 1 FMAC and 1 FDIV, just like the CPU's FPU. The EFU performs some of the more basic calculations required for geometry calculation.
Another big difference between VU1 and VU0 is that VU1 has 16K of data memory and 16K of instruction memory (as opposed to VU0's 8K data/8K instruction sizes). This larger amount of data memory is needed because VU1's role as a geometry processor requires that it handle much more data than VU0. 
Finally, VU1 has multiple paths it can take to get data to the GIF (and on to the GS). Like VU0, VU1 can send display lists to the GIF via the main, 128b bus. Or, VU1's VIF can send data directly to the GIF. Finally, there's a direct connection between VU1's 16K data memory and the GIF, meaning that VU1 can work on a display list in data memory, and the DMAC can transfer the results directly to the GIF.
 
I have to pause here and note that there's some serious confusion in Sony's literature on the direct path between VU1 and the GIF. One diagram for a slide show seems to show the path as connecting the instruction memory to the GIF, another diagram quite obviously shows the path going from the lower execution unit to the GIF, and yet another shows it with the path connecting the data memory to the GIF. This last one is the only one that makes sense to me, but I went ahead and left my diagram ambiguous.
As you'll recall from the discussion of VU0, VU0 is controlled by the CPU, and VU0 gets its instructions from whatever program the CPU is currently running. VU1, however, doesn't work that way. VU1's VIF plays a much more prominent role in VU1's life than VU0's VIF does in its. VU1's VIF takes in and parses what Sony confusingly calls a 3D display list. This 3D display list is not VU1's program. Rather, it's a data structure that contains two types of information, and some specialized commands that tell VU1 how to handle this information. The two types of info are 
a. the actual VU1 program instructions, which go in VU1's instruction memory.
b. the data that said program operates on. This goes in VU1's data memory. 
The VIF decodes and parses the 3D display list commands, and makes sure that VU1 program code and data find their way into the correct spots. In this manner, VU1 can operate independently of the CPU to generate display lists. Executing these VU1, "VLIW mode" programs brings into play those three units that VU0 often neglects: the LSU, the iALU, and the RNG. These three units, along with the EFU (which acts as a general FPU), all function to make VU1 a full-blown SIMD/VILW coprocessor.  Hahaha...there's that term again: SIMD/VLIW. Now it's time to find out what it means.

Programming the VUTo wrap up, we're going to take a look at the VU's instruction format, and talk about its microarchitecture in a bit more detail. We'll look at VU1 first, because it always runs in "VLIW mode." Then we'll talk about VU0, and how it's different.
Both VU1 and VU0 are 2-issue, VLIW processors. The basic instruction format for VU1 is a 64-bit, VLIW instruction packet (or "bundle," or whatever VLIW term you want to use) that can be divided into two, 32-bit COP2 instructions. 
These two instructions are executed in parallel by two execution units: the upper execution unit and the lower execution unit. (Refer back to the two VU block diagrams on the last page. The upper unit is blue and the lower unit is green.) These two units have the following functionality:
            Upper instructions         Lower instructions     
                 4 parallel FP ADD/SUB           
4 parallel FP MUL           
4 parallel FP ADD/MSUB           
4 parallel MAX/MIN           
Outer product calculation           
Clipping detection
                        FP DIV/SQRT/RSQRT           
Load/Store 128b data           
EFU (1 FMAC + 1 FDIV)           
Jump/Branch           
Random number generator/misc
            
   
Now, what you should note is that the upper execution unit is a SIMD unit, while the lower isn't.  Hence the term "VLIW/SIMD." So the code is "64-bit VLIW," of which 32 bits are SIMD. Cool, eh? I thought so.
To see this in action, let's look at a code example that I've adapted from one of Sony's slides. 
          Upper Instruction       Lower Instruction     
       MUL  VF04, VF03, Q        DIV  Q, 1.0, VF02.w     
       MUL  ACC, VF10, VF01.x       MOVE VF03, VF02      
       MADD ACC, VF11, VF01.y       ADD  VI03M VI03, -1     
       MADD ACC, VF12, VF01.z        NOP     
       MADD VF02, VF13, VF01.w       LQ   VF01, VI01++     
       NOP       BGTZ VI03, LOOP     
       NOP       SQ   VF04, VI02++      
  
The instructions on the left are executed by the upper execution unit, while the ones on the right are executed by the lower unit.
VU0 is a bit different from VU1 in that, instead of operating in VLIW mode all the time, it normally runs in "MIPS Coprocessor Mode." A MIPS Coprocessor instruction is a 32b instruction and not a 64-bit VLIW instruction. So this means that when it's in COP mode, VU0 can crunch 4, 32-bit FP SIMD numbers in parallel, using the 4 FMACs in the upper execution unit. (I'm assuming that in this situation, the upper opcode contains the SIMD FP instruction op and the lower opcode a NOP.)
VU0 doesn't have to stay in COP mode though. It can operate in VLIW mode by calling a micro-subroutine of VLIW code. In this case, it takes a 64-bit instruction bundle and splits it into two 32-bit MIPS COP2 instructions, and executes them in parallel, just like VU1.
As you can see, having two operating modes for VU0 is a bit complex, but it gives the unit a lot of flexibility.


ConclusionAs you can see from the above breakdown, with a total of 10 FMACs, 4 FDIVs, and all the other integer, branch, and load/store resources available, the Emotion Engine is a hoss. 
Not only does the Emotion Engine have horsepower under the hood, but its aggressively new, cutting-edge design means that it's going to take a while for developers to really learn to use all that power. It'll be interesting to see if the PC has caught up with the PS2 by the time PS2 developers figure out how to exploit this hardware to its fullest potential.
Although I've stated repeatedly that the PS2's number one application is 3D gaming, neither Sony nor Toshiba (Toshiba designed the Emotion Engine, and Sony licenses it) are going to sit by and let this hardware get pigeonholed in that application space. Sony has invested big, big money (I think it's around $100 million) in developing non-game applications for the PS2. So by the time the PS2 goes stateside, we should see other types of software available for it. This device is going to be the centerpiece of Sony's assault on the world's living rooms, so you can bet they'll milk it for all they can. 
Toshiba is also planning to leverage the Emotion Engine in other markets. I don't have any details, but I'd imagine that before too long we can expect to see a whole range of devices based on this chip. As far as its options in the embedded market, it's not exactly the lowest power device available. Here are some specs that should give you an idea how it stacks up, process wise, to other CPUs out there.
                    Clock         250 MHz       
         VDD         1.8v        
         Design Rule         0.25 um       
         Gate Length          0.18 um       
         Power         13 watts       
          Transistors         10.5 million       
         Die size         17 mm x 14.1 mm (240 mm2)        
         Package         540-pin PBGA (Ball Grid Array)       
         Layers          4-layer metal       
       
 
Just to put things in perspective, 10.5 million transistors is the same number of transistors that the G4 has, with the K7 weighing in at about 22 million transistors. So while the Emotion Engine isn't exactly as svelte as Crusoe, it's pretty darn lean considering all the hardware that's packed onto it. 
All in all, it should be a fascinating ride in the next few months as MS and Nintendo begin to ready their own console offerings. The PS2 has really upped the ante in terms of raw gaming horsepower, so MS and Nintendo are going to have to offer something killer in response. (Was I the only one who was unimpressed by the recently-released X-Box specs? I hope nVidia packs some amazing hardware into it, because after looking at the Emotion Engine, a 600MHz Intel offering ain't turnin' me on...maybe if it's a Willamette...) All speculation aside though, one thing is definitely for certain. As of the Japanese launch of the Playstation 2 last month, the home entertainment scene just got much, much more exciting.
 
Bibliography (and props)Unfortunately, none of the articles on the PS2 that I used for my research are (legally) available free of charge. I think you can get most of them via the Ask IEEE service on the web though. Speaking of docs, I want to send a big whopping thank you to all the folks who responded to my recent cry for help. I've been poring over the docs in preparation for this article, and I haven't had time to write thank-you's yet. Emails will definitely be forthcoming though, because you guys made this article possible. Now, to the bibliography:
K. Kutaragi et al., "A Micro Processor with a 128b CPU, 10     Floating-Point MACs, 4 Floating-Point Dividers, and an MPEG2 Decoder,"      ISSCC (Int’l Solid-State Circuits Conf.) Digest of Tech. Papers,Feb. 1999,     pp. 256-257.   
F.M. Raam et al., “A High Bandwidth Superscalar Microprocessor for     Multimedia Applications,” ISSCC Digest of Technical Papers,Feb. 1999, pp.     258-259.   
A. Kunimatsu et al., 5.5 GFLOPS Vector Units for “Emotion     Synthesis”, (Slide show and presentation.) System ULSI Engineering     Laboratory, TOSHIBA Corp. and Sony Computer Entertainment Inc.   
Masaaki Oka Masakazu Suzuoki. Designing and Programming the Emotion     Engine, Sony Computer Entertainment. IEEE Micro, pp. 20-28   
Various other slides from presentations that people mailed me. I have no     idea where they came from, but if there was copyright info on them and I     used a diagram from them, I included it.   
Berkeley Design Technology, Inc. DSP     Processors and Cores -- the Options Multiply.  Reprinted from Integrated     System Design, June, 1995   
Berkeley Design Technology, Inc. Choosing     a DSP Processor.

==========================================
SONY     COMPUTER ENTERTAINMENT ANNOUNCES WORLD’S FASTEST 128 Bit CPU "EMOTION     ENGINE" FOR THE NEXT GENERATION PLAYSTATION
     TOKYO, March 2, 1999 -- Sony Computer Entertainment Inc. is pleased to     announce the co-development with Toshiba Corp. of the 128 bit CPU ("EE", or     "Emotion EngineÔ ") for use in the next generation of PlayStationÒ . In order     to process massive multi-media information at the fastest possible speeds, data bus, cache     memory as well as all registers are 128 bits; this is integrated on a single chip LSI     together with the state of the art 0.18 micron process technology. The development of a     full 128-bit CPU is the first of its kind in the world.
Not only will this new CPU     have application for games, but it will be the core media processor for future digital     entertainment applications, and has a vastly superior floating point calculation     capability compared to the latest personal computers. The new CPU incorporates two 64-bit     integer units (IU) with a 128-bit SIMD multi-media command unit, two independent floating     point vector calculation units (VU0, VU1), an MPEG 2 decoder circuit (Image Processing     Unit/IPU) and high performance DMA controllers onto one silicon chip. The massive combined     performance of this CPU permits complicated physical calculation, NURBS curved surface     generation and 3D geometric transformations, which are difficult to perform in real time     with PC CPUs, to be performed at high speeds.
In addition, by processing the data at     128-bits on one chip, it is possible to process and transfer massive volumes of     multi-media data. CPUs on conventional PCs have a basic data structure of 64 bits, with     only 32 bits on recent game consoles. The main memory supporting the high speed CPU uses     the Direct Rambus DRAM in two channels to achieve a 3.2GB/second bus bandwidth. This     equates to four times the performance of the latest PCs that are built on the PC-100     architecture.
By incorporating the MPEG 2 decoder     circuitry on one chip, it is now possible to simultaneously process high-resolution 3D     graphics data at the same time as high quality DVD images. The combination of the two     allows the introduction of a new approach to digital entertainment and real-time graphics     and audio processing.
With a floating point calculation     performance of 6.2GFLOPS/second, the overall calculation performance of this new CPU     matches that of a super computer. When this is applied to the processing of geometric and     perspective transformations normally used in the calculation of 3D computer graphics     (3DCG), the peak calculation performance reaches 66 million polygons per second. This     performance is comparable with that of high-end graphics workstations (GWS) used in motion     picture production.
Rambus is a     registered trademark of Rambus Inc.
Emotion Engine     Features and General Specifications
CPU core 128bit RISC (MIPS     IV-subset)
Clock Frequency 300MHz
Integer Unit 64bit (2-way Superscalar)
Multimedia extended instructions 107     instructions at 128bit width
Integer General Purpose Register 32 at 128     bit width
TLB 48 double entries
Instruction Cache 16KB (2-way)
Data Cache 8KB (2-way)
Scratch Pad RAM 16KB (Dual port)

=================================
Main Memory 32MB (Direct RDRAM 2ch@800MHz)
Memory bandwidth 3.2GB/sec
DMA 10 channels
Co-processor1 FPU (FMAC x 1, FDIV x 1)
Co-processor2 VU0 (FMAC x 4, FDIV x 1)
Vector Processing Unit VU1 (FMAC x 5, FDIV     x 2)
Floating Point Performance 6.2GFLOPS
Geometry 
+ Perspective Transformation 66Million     Polygons/sec
+ Lighting 38Million Polygons/sec
+ Fog 36Million Polygons/sec
Curved Surface Generation (Bezier)     16Million Polygons/sec
Image Processing Unit MPEG2 Macroblock     Layer Decoder
Image Processing Performance 150Million     Pixels/sec
Gate width 0.18 micron
VDD Voltage 1.8 V
Power Consumption 15 Watts
Metal Layers 4
Total Transistors 10.5 Million
Die Size 240 mm2
Package 540pin PBGA
 
 
*) 4 dimensional       calculation to single precision floating point.
      EE performance based on measured data. For P2 and P3,       theoretical maximum values based on manufacturer’s figures and other published data.
 
=================================
SONY COMPUTER     ENTERTAINMENT ANNOUNCES THE DEVELOPMENT OF THE WORLD’S FASTEST GRAPHICS RENDERING     PROCESSOR USING EMBEDDED DRAM TECHNOLOGY
    TOKYO, March 2, 1999 --     Sony Computer Entertainment has developed the Graphics Synthesizer for the next generation     PlayStation® incorporating a massively parallel rendering engine that contains     a 2,560 bit wide data bus that is 20 times the size of leading PC-based graphics     accelerators. Very high pixel fill rates and drawing performance is achieved only through     the use of embedded DRAM process technology pioneered by SCE for use in advanced graphics     technology.
    The current PlayStation introduced the     concept of the Graphics Synthesizer via the real-time calculation and rendering of a 3D     object. This new GS rendering processor is the ultimate incarnation of this concept –     delivering unrivalled graphics performance and capability. The rendering function was     enhanced to generate image data that supports NTSC/PAL Television, High Definition Digital     TV and VESA output standards. The quality of the resulting screen image is comparable to     movie-quality 3D graphics in real time.
    In the design of graphics systems, the     rendering capability is defined by the memory bandwidth between the pixel engine and the     video memory. Conventional systems use external VRAM reached via an off-chip bus that     limits the total performance of the system. However in the case of the new GS, there is a     48-Gigabyte memory access bandwidth achieved via the integration of the pixel logic and     the video memory on a single high performance chip. This allows orders of magnitude     greater pixel fill-rate performance compared to today’s best PC-based graphics     accelerators.
    When rendering small polygons, the peak     drawing capacity is 75 Million polygons per second and the system can render 150 Million     particles per second. With this large drawing capability, it is possible to render a     movie-quality image. With Z-buffering, textures, lighting and alpha blending     (transparency), a sustained rate of 20 Million polygons per second can be drawn     continuously.
    This new architecture can also execute     recursive multi-pass rendering processing and filter operations at a very fast speed     without the assistance of the main CPU or main bus access. In the past, this level of     real-time performance was only achieved when using very expensive, high performance,     dedicated graphics workstations. However, with the design of the new Graphics Synthesizer,     this high quality image is now available for in-home computer entertainment applications.     This will help accelerate the convergence of movies, music and computer technology into a     new form of digital entertainment.
    Graphics Synthesizer – Features and General     Specifications
    GS Core Parallel Rendering Processor with embedded DRAM
    Clock Frequency 150 MHz
    No. of Pixel Engines 16 (in Parallel)
    Embedded DRAM 4 MB of multi-port DRAM (Synced at 150MHz)
    Total Memory Bandwidth 48 Giga Bytes per Second
    Combined Internal
    Data Bus bandwidth 2560 bit
    Read 1024 bit
    Write 1024 bit
    Texture 512 bit
    Display Color Depth 32 bit (RGBA: 8 bits each)
    Z Buffering 32 bit 
    Rendering Functions Texture Mapping, Bump Mapping
    Fogging, Alpha Blending
    Bi- and Tri-Linear Filtering
    MIPMAP, Anti-aliasing
    Multi-pass Rendering
    Rendering Performance
    Pixel Fill Rate 2.4 Giga Pixel per Second 
    (with Z buffer and Alphablend enabled)
    1.2 Giga Pixel per Second 
    (with Z buffer, Alpha and Texture)
    Particle Drawing Rate 150 Million /sec
    Polygon Drawing Rate 75 Million /sec (small polygon)
    50 Million /sec (48 Pixel quad with Z and A)
    30 Million /sec (50 Pixel triangle with Z and A)
    25 Million /sec (48 Pixel quad with Z, A and T)
    Sprite Drawing Rate 18.75 Million (8 x 8 Pixels)
     
    Display output NTSC/PAL
    Digital TV (DTV)
    VESA (maximum 1280 x 1024 pixels)
    Silicon process technology 0.25 µ 4-level metal
    Total number of transistors 43 Million 
    Die size 279mm2
    Package Type: 384 pin BGA
     
    

===========================
SONY COMPUTER     ENTERTAINMENT ANNOUNCES THE DEVELOPMENT OF AN I/O PROCESSOR FOR THE NEXT GENERATION     PLAYSTATION® THAT PROVIDES 100% BACKWARDS COMPATIBILITY
    TOKYO, March 2, 1999 -- Sony     Computer Entertainment has developed the I/O Processor with LSI Logic Corporation for the     next generation PlayStation. By embedding this processor we have achieved 100% backward     compatibility with the current PlayStation. In addition, the new I/O Processor supports     IEEE 1394 and Universal Serial Bus (USB) which are the new standards for digital     interconnectivity.
    The new I/O Processor for the next generation     PlayStation is based on the current PlayStation CPU but with enhanced cache memory and a     new, higher performance DMA architecture that permits a four-fold increase in data     transfer rates. The serial interface is also upgraded to over 20 times the performance of     the current PlayStation. In addition, the USB host controller and the IEEE 1394 link and     physical layers are integrated onto this single chip LSI.
    The USB interface is compatible with OHCI (Open Host     Controller Interface) and can handle data transfer rates of between 1.5Mbps and 12Mbps     (Mega bits per second). IEEE 1394 can handle data transfer rates of between 100 Mbps and     400 Mbps.
    The use of these interfaces allows the future     connectivity of the new PlayStation system to a variety of other systems and consumer     products such as VCR, Set Top Box, Digital Camera, Printer, Joystick, Keyboard and Mouse     amongst others.

SIMD architectures

By Jon Stokes | Published: March 21, 2000 - 07:00PM CT

Introduction

What do Sony's Playstation2 and Motorola's MPC7400 (a.k.a. the G4) have in common?  Besides the incredible hype behind both products and their legions of crazed fans, there's one acronym that unites them all--an acronym that sums up the secret to their stellar performance: SIMD.  Single Instruction stream, Multiple Data streams (SIMD) computing first entered the personal computing world in the form of Intel's neglected addition to the x86 instruction set, MMX.  Even though MMX was panned by the press and was slow to be adopted, SIMD computing was here to stay on the personal computing landscape.  And it's a good thing too, because SIMD is a technology whose time has definitely come, and it's just about ubiquitous on the desktop: MMX, SSE, 3DNow!, AltiVec, etc. are all acronyms for SIMD instruction sets. In this article, we're going to look at what SIMD is, what it offers, and how it's  integrated in three-and-a-half of today's hottest processors.  Three and a half?  The half is Sun's upcoming  MAJC architecture, which isn't actually out yet.  We've included it here because its approach to SIMD is quite different from the other three, so it provides a nice contrast. 
This article will provide       a basic introduction to SIMD concepts, as well as an overview of the three       and a half SIMD implementations under discussion.  One thing that       should definitely be understood is that this article is actually the       sequel to my previous G4 vs. K7 tech       article.  If you want to look at AltiVec and 3DNow! in the context of       both the G4 and K7 as a whole, then you must read the first article       too.  This article focuses in on the SIMD, and ignores many of the       important issues already taken up by its predecessor.
  
SIMD basicsEarly microprocessors didn't actually have any floating-point capabilities; they were strictly integer crunchers.  Floating-point calculations were done on separate, dedicated hardware, usually in the form of a math coprocessor.  Before long though, transistor sizes shrunk to the point where it became feasible to put a floating-point unit directly onto the main CPU die, and the modern integer/floating-point microprocessor was born.  Of course, the addition of floating-point  hardware meant the addition of floating-point instructions.  For the x86 world, this meant the introduction of the x87 floating-point architecture and its (now hopelessly archaic) stack-based register model.
So the x87 brought a new name, new capabilities, new registers, and new instructions to Intel's microprocessors.  Sound familiar?  It should.
Actually, the addition of SIMD instructions and hardware to a modern, superscalar CPU is a bit more drastic than the addition of floating-point capability.  A microprocessor is a SISD device (Single Instruction stream, Single Data stream), and it has been since its inception.  
 
As you can see from the above picture, a SIMD machine exploits a property of the data stream called  data parallelism.  You get data parallelism when you have a large mass of data of a uniform type that needs the same instruction performed on it.  A classic example of data parallelism is inverting an RGB picture to produce its negative.  You have to iterate through an array of uniform integer values (pixels), and perform the same operation (inversion) on each one -- multiple data points, a single operation.  Modern, superscalar SISD machines exploit a property of the  instruction stream called  instruction-level parallelism  (ILP).  In a nutshell, this means that you execute multiple instructions at once on the same data stream.  (See  my other articles for more detailed  discussions of ILP).  So a SIMD machine is a different class of machine than a normal microprocessor.  SIMD is about exploiting parallelism in the data stream, while superscalar SISD is about exploiting parallelism in the instruction stream.
There were some early, ill-fated attempts at making a  purely SIMD machine (i.e., a SIMD-only machine).  The problem with these attempts is that the SIMD model is simply not flexible enough to accoodate general purpose code.  The only form in which SIMD is really feasible is as a part of a SISD host machine that can execute conditional instructions and other types of code that SIMD doesn't handle well.  This is, in fact, the situation with SIMD in today's market.  Programs are written for a SISD machine, and include in their code SIMD instructions.
One thing I'd like to       note for the sake of all you nit-pickers out there, is that I'm going by       the description of SISD as laid out in  Hennessey and       Patterson.  A       more detailed discussion of the finer points of SISD vs. SIMD as concepts,       while it would be appropriate here, would hinder us from moving more       quickly to the actual comparison of the SIMD implementations.


SIMD operationsThe basic unit of SIMD love is the vector, which is why SIMD computing is also known as vector processing.  A vector is nothing more than a row of individual numbers, or scalars.
A regular CPU operates on scalars, one at a time.  (A superscalar CPU operates on multiple scalars at once, but it performs a different operation on each instruction.)  A vector processor, on the other hand, lines up a whole row of these scalars, all of the same type, and operates on them as a unit.  
These vectors are represented in what is called packed data format.  Data are grouped into bytes (8 bits) or words (16 bits), and packed into a vector to be operated on.  One of the biggest issues in designing a SIMD implementation is how many data elements will it be able to operate on in parallel.  If you want to do single-precision (32-bit) floating-point calculations in parallel, then you can use a 4-element, 128-bit vector to do four-way single-precision floating-point, or you can use a 2-element 64-bit vector to do two-way SP FP.  So the length of the individual vectors dictates how many elements of what type of data you can work with.
Motorola's AltiVec literature divides into four useful and easily comprehendible categories the types of SIMD operations that AltiVec can do.  These categories are a good way of dividing up the basic things you can do with vectors.  Unfortunately for people who write SIMD comparison articles, both AMD's and Intel's tech docs categorize their hardware's SIMD operations in a completely different and less accessible way.  (Actually, Intel's tech docs categorize things one way, and AMD's tech docs copy Intel's categorization.  It's good to see that at least Motorola can think differently.)  I'm going to use Motorola's categories, at least initially, for tutorial purposes.  I'm also going to rob some of Motorola's pictures out of their AltiVec literature, and modify them a bit.  
I.  Intra element arithmetic and non-arithmetic functions. 
 
Intra-element arithmetic is one of the most basic and obvious types of SIMD operation.  Consider an intra-element addition.  This involves lining up two vectors (VA and VB), and adding their individual elements together to produce a sum vector (VT). The above picture shows an example of inter-element arithmetic at work. Inter-element operations also include multiplication, multiply-add, average, and min.
Intra-element non-arithmetic functions basically work the same as above, except for the fact the operations performed are different.  Intra-element non-arithmetic operations include AND, OR, and XOR.
                  Vector intra element instructions
     
                integer instructions
integer arithmetic instructions
integer compare instructions
integer rotate and shift instructions
floating-point instructions
floating-point arithmetic instructions
floating-point rounding and conversion instructions
floating-point compare instruction
floating-point estimate instructions
memory access instructions
            
    
 
II.  Inter Element Arithmetic 
 
Inter-element operations are operations that happen between the elements in a single vector.  An example of an inter-element arithmetic operation is shown above.  This operation sums across the elements in a vector, and stores the result in an accumulation vector.
IV.  Inter Element Non-arithmeticInter-element non-arithmetic operations are operations like vector permute, which rearrange the order of the elements in an individual vector.  We'll look at the permute operation a little closer in a later section.
       Vector inter element instructions
     
                alignment support instructions 
permutation and formatting instructions 
pack instructions 
unpack instructions 
merge instructions 
splat instructions 
permute instructions 
shift left/right instructions

Saturated arithmetic

One feature that all the       SIMD implementations under discussion share is support for saturated       arithmetic. With wrap-around arithmetic, whenever you do a       calculation whose result turns out to be bigger than what you can       represent with whatever data format you're using (16-bit, 32-bit, etc.),       the CPU stores a wrap-around number in the destination register and sets       some sort of overflow flag to tell the program that the value exceeded its       limits.  This isn't really ideal for media applications though.        If you add two 32-bit color pixel values together and get a number that's       greater than what you can represent with 32-bits, you just want the result       to come out as the maximum represent able value (#FFFFFF, or white).        You don't really care that the number was too big; you just want to       represent the extreme value.  It's sort of like turning up the volume       on an amplifier past 10 (Spinal Tap jokes aside).  You can keep on       turning that knob, but the amp is already maxed out--which is what you       want.
 
AltiVecI'll start my description       of SIMD implementations with AltiVec, because of its simplicity and straightforward       design.  Even though Intel's and AMD's SIMD implementations came        before AltiVec chronologically, I'll use AltiVec as the norm and treat the       other two as deviations.  I do this mainly for didactic purposes; it       makes the material easier to understand.  
Unlike AMD and Intel,       Motorola took a dedicated hardware approach to SIMD.  They added 32       new AltiVec registers to the G4's die along with two dedicated AltiVec       SIMD functional units, thus increasing the die size of the G4.  Nevertheless, the G4's die is       still under 1/3 the size of the PIII's, which is itself about half the       size of the Athlon's.  Since the G3 was so small to begin with (in       comparison to Intel's and AMD's offerings), Motorola could afford to spend       the extra transistors adding dedicated SIMD hardware.
All of the AltiVec       calculations are done by one of two fully-pipelined, independent AltiVec       execution units.  The first unit is the Vector Permute Unit.  It       handles vector operations that involve rearranging the order of the       elements in a vector.  These are those inter-element operations, like       pack, unpack, and permute. It also handles vector memory accesses -- the       loading and storing of vectors into the registers. 
 The second piece of hardware is the Vector ALU.        This unit handles all of the vector arithmetic (integer and FP multiply, add, etc.) and       logic (AND, OR, XOR, etc.) operations.  Most of these fall under the       heading of intra-element operations, where you're combining two vectors       (and possibly a control vector) to get a result.
Both of these execution       units are fully pipelined and independent.  This means that the G4       can execute two 128-bit vector operations per cycle (one ALU, one       permute), and it can do so in parallel with regular floating-point       operations and integer operations.  The units are also pretty fast.  The instruction latency is 1 cycle for simple       operations, and 3-4 cycles for more complex ones.
As I noted above, AltiVec       has 32 architectural SIMD registers.  This is a lot of registers, and       they really give the compiler freedom to schedule operations and manage       register usage for maximum efficiency.  Each register is 128 bits wide, which means       that AltiVec can operate on vectors that are 128 bits wide.        AltiVec's 128-bit wide vectors can be subdivided into
16 elements, where           each element is either an 8-bit signed or unsigned integer, or an           8-bit character.
8 elements, where each           element is a 16-bit signed or unsigned integer
4 elements, where each           element is a either a 32-bit signed or unsigned integer, or a single           precision (32-bit) IEEE floating-point number.
That last bullet point is       especially important to note.  The ability to grind through vectors       of four, single precision floating-point numbers every cycle is       impressive, and represents a key advantage of AltiVec.
 
InstructionsAltiVec adds 162 new       instructions to the G4's instruction set.  The AltiVec instruction       format is especially nice, as it allows you to use 4 distinct registers to       do your computations: two source registers to hold the operands, 1       filter/modifier register, and 1 destination register to hold the result.
 
 
The diagram above shows a       basic, intra-element operation.  VA and VB are the source registers,       and VC is a filter/modifier register that can hold masks, or otherwise       modify a computation. VT is the destination register.
The filter/mod register       adds a lot of flexibility, especially when you're doing something like a       vector permute.
 
In the picture above, VA       and VB contain the two vectors to be permuted, and VC contains the control       vector that tells AltiVec which elements it should put where. 
 
InterruptsAnother important       advantage of AltiVec that deserves to be pointed out is that there are no       interrupts except on vector LOADs and STOREs.  You have to have       interrupts for LOADs and STOREs in case of, for instance, a cache       miss.  If AltiVec tries to LOAD some data from the L1 cache into a       register, and that data isn't there, it throws an interrupt (stops       executing) so that it can wait for the data to arrive.
AltiVec doesn't, however,       have interrupts for things like overflows and such (remember the saturated       arithmetic discussion).  Furthermore, the peculiar implementation       that 3DNow! and SSE use to do 128-bit single-precision FP means that a       128-bit fp calculation can throw an interrupt, saturation or no.        More on that when we talk about SSE, though.
The upshot of all this is       that AltiVec can keep up its single-cycle throughput as long as the L1       keeps the data stream full.  The FP and integer instructions aren't       going to hold up execution by throwing an interrupt.

The Story of MMX

The story of MMX and SSE/KNI/MMX2       is quite a bit more complicated than AltiVec's. There are a number of       reasons why this is so. To begin with, Intel introduced MMX first as an       integer-only SIMD solution. MMX doesn't support floating-point arithmetic       at all. Even as MMX was being rolled out, Intel knew that they had to       include FP support at some point. An article in  an issue of the Intel       Technology Journal tells this story:
In February 1996,       the product definition team at Intel presented Intel's executive staff       with a proposal for a single-instruction-multiple-data (SIMD) floating       point model as an extension to IA-32 architecture. In other words, the       "Katmai" processor, later to be externally named the Pentium III       processor, was being proposed. The meeting was inconclusive. At that time,       the PentiumÆ processor with MMX instructions had not been introduced and       hence was unproven in the market. Here the executive staff were being       asked essentially to "double down" their bets on MMX       instructions and then on SIMD floating point extensions. Intel's executive       staff gave the product team additional questions to answer and two weeks       later, still in February 1996, they gave the OK for the "Katmai"       processor project. During the later definition phase, the technology focus       was refined beyond 3D to include other application areas such as audio,       video, speech recognition and even server application performance. In       Febuary 1999, the Pentium III processor was introduced.
Another complicating       factor for MMX is the fact that Intel jumped through some hoops to avoid       adding a new processor state, hoops that complicated the implementation of       MMX. I'll deal with this in more detail shortly.
Instead of discussing MMX       and SSE together, I'll first discuss MMX alone. This will lay the       groundwork for the discussion of both SSE and 3DNow!, since they're both       expansions of MMX, and competitors to boot.
 
The elementsI'll save a discussion of       MMX's implementation on the PIII and its attendant problems for the       section on SSE.  For now, let's consider some basic features of MMX       as an instruction set. Where AltiVec's vectors       are 128 bits wide, MMX's are only 64 bits wide. These 64-bit vectors can       be subdivided into 
8 elements (a packed           byte), where each element is a 8-bit integer, 
 4 elements (a packed           word), where each element is a 16-bit signed or       unsigned integer, or 
2 elements (packed double           word), where each element is a 32-bit signed or unsigned           integer. 
 These vectors are stored in 8 MMX registers,       based on a flat file model. These 8 registers, MM0-MM7, are aliased onto       the x87's stack-based floating-point registers, FP0-FP7. Intel did this in       order to avoid imposing a costly state switch any time you want to use MMX       instructions. The drawback to this approach is that floating-point       operations and MMX operations must share a register space, so a programmer       can't mix floating-point and MMX instructions in the same routine. Of       course, since there's no mode bit for MMX or FP, there's nothing to       prevent a programmer from pulling such a stunt and corrupting his       floating-point data.
The fact that you can't       mix floating-point and MMX instructions normally isn't a problem, though.       In most programs, floating-point calculations are used for generating       data, while SIMD calculations are used for displaying it.
In all, MMX added 57 new       instructions to the x86 ISA. The MMX instruction format is pretty much like       the conventional x86 instruction format:
MMX Instruction mmreg1,       mmreg2
In the above instruction,       mmreg1 is the both the destination and source operand, meaning that mmreg1       gets overwritten by the result of the calculation.
 
 
For obvious reasons, this       situation isn't nearly as optimal as AltiVec's. If you perform an MMX op       and then immediately afterwards need one of the source vectors again,       you're SOL. Either you made a backup copy of it in another register       beforehand, or you've got to reload it; both options take extra time and       hinder performance.
Another thing that makes       the MMX instruction format less cool is that MMX operations lack that       third filter/mod vector that AltiVec has. This means that you just can't       do those once-cycle, arbitrary two-vector permutes. Oh well...


Sequel: MMX2/SSE/KNIIf you thought that MMX       was hobbled by backwards compatibility issues, wait until you get a load       of SSE (the ISA extension formerly known as MMX2/KNI). Intel's goal with       SSE was to add four-way, 128-bit SIMD single-precision floating-point       computation to the x86 ISA. Did they succeed? Well, sorta.
With the PIII, Intel went       halfway on adding dedicated hardware to the CPU die. The PIII has two,       fully-pipelined, independent, SIMD, single-precision floating-point units       (that's a mouthful). However, for SIMD floating-point multiplication, they       expanded on the existing FP multiplier hardware. So SIMD FP       multiplications share an execution unit with regular FP multiplications.       The PIII does, in fact, have a dedicated SIMD FP adder, which is       independent of the regular floating-point hardware.
As far as the registers,       Intel went ahead and added an extra 8, 128-bit registers for holding SIMD       floating-point instructions. These eight are in addition to the 8 MMX/x87       registers that were already there. Since these registers are totally new       and independent, Intel had to hold their nose and add an extra processor       state to accommodate them. This means a state switch if you want to go       from using x87 to MMX or SSE. It also means that OS code had to be rewritten to accommodate the new       state.
 
 
Now, when I said that       Intel "sorta" succeeded in adding four-way, 128-bit SIMD FP to       the x86 ISA, I meant that the way the PIII handles it is kind of a hack.       See, a 4-way FP SSE instruction gets broken down into two, 2-way (64-bit)       microinstructions. These instructions are then executed either in parallel       or sequentially by the       two SIMD units. "Wait a minute," you object. "Doesn't one       SIMD unit do addition and the other do multiplication?" Yeah, that's       the case. So what this means for sustained 128-bit computation with 1       op/cycle throughput is that you can only       do it with floating-point multiply-add instructions. These instructions       show up in dot product calculations, so they're pretty common. Still, it's       not as cool as being able to do just any 128-bit vector calculation you       like at 1 op/cycle.
 
 
Intel made this decision       for a number of reasons. First and foremost, they wanted to conserve die       space. As it is, MMX/SSE adds 10% to the size of the PIII's die. If they       had gone ahead and implemented an independent SIMD multiplication unit,       this percentage would have been higher. So they reused some FP hardware to       keep the transistor count low. Furthermore, doing things this way allows       them to use the existing 64-bit data paths inside the CPU to do 128-bit       computation. Adding dedicated SIMD floating-point hardware and a 128-bit       internal data path to push vectors down would have really eaten up       transistor resources. Intel was also able to limit the changes to the       PIII's instruction decoder by implementing 128-bit SIMD FP in this manner.       Finally, the fact that the SIMD adder and multiplier are independent and       on two different ports, the PIII can dispatch a 64-bit add and a 64-bit       multiply at the same time.
Remember that all of this       128-bit talk applies only to floating-point instructions. Old-school       integer MMX calculations are still restricted to the world of 64-bits.       Such is the price of backwards compatibility.
 
InterruptsBy breaking up the       128-bit ops into two 64-bit uops ("uop" = microinstruction) and       running them either concurrently or sequentially, the PIII opens itself up       to the possibility that one of the uops will encounter a snag and have to       bail out ("throw an exception") after the other one has already       been retired.  If this were to happen, then only half of the 128-bit       destination register would hold a valid result.  Ooops.  
To prevent this, the PIII       includes special hardware in the form of a Check Next Micro-Operation (CNU)       mechanism.  What this hardware does is keep the first uop from       retiring if the second one throws an exception. This means that once the       re-order buffer (he keeps track of execution and retirement) gets the       first, completed uop of a 128-bit instruction, it has to wait up for the       second uop to finish before it can retire them both.  This has the       potential to slow things down.
Intel got around this by       taking advantage of a common case in multimedia processing.  Often,       as in the case of saturated arithmetic, exceptions like overflow and       underflow are masked, which means that the programmer has told the       processor to just ignore them.  If an MMX or SSE instruction has its       interrupts masked, then since the PIII would ignore the exception anyway       it just doesn't bother having the re-order buffer (ROB) wait up for the       second uop.  In this case then, the ROB can go ahead and retire each       uop individually.  This is much faster, and since it's the common       case it reduces the impact of exception handling on performance. 
3DNow! and Advanced       3DNow!AMD's 3DNow!, as it's       implemented on the Athlon, faces problems similar to those faced by SSE.       Since 3DNow!, like SSE, incorporates MMX in all its 64-bit, x87       register-sharing glory, it has to deal with all the less desirable       features of MMX (only 64-bit integer computation, the two-operand       instruction format, etc.). 3DNow! takes the 57 MMX instructions and adds       to them 21 unique instructions that handle floating-point arithmetic, for       a total of 78 instructions. The Athlon's Advanced 3DNow! adds another 24       new, SSE-like instructions (for DSP, cache hinting, etc.), bringing the       SIMD instruction count up to 114.
3DNow! simulates four-way       single precision (128-bit) FP computation the same way that the PIII does,       by breaking 4-way instructions down into a pair of 2-way       microinstructions, and executing them in parallel on two different SIMD       execution units. Like the PIII, the two units are independent of each       other, and one does addition and the other multiplication. This means that       for 3DNow! to be able to do sustained 128-bit computation, it has to       either issue a 2-way single precision multiply and a 2-way single       precision add in parallel. However, unlike either the PIII or Altivec,       3DNow has no 128-bit instructions. So any "128-bit SIMD       computation" that it does is purely the result of using two 64-bit       instructions in parallel. Another big difference between the Athlon's SIMD implementation and the       PIII's is that the Athlon has two, independent, fully-pipelined       floating-point functional units, and both of them do double duty as SIMD       FUs. (Recall that the PIII has two FPUs that aren't fully pipelined, and       only one of them does double duty as a SIMD FU.)
The final important difference between SSE and 3DNow! is the fact that all the 3DNow! operations, bother integer and floating-point, share the same 8 registers with MMX and x87 operations. There are no added registers, like on the PIII. This is good and bad. It's good in that you can switch between 3DNow! and MMX instructions without a costly change of state. It's bad insofar as that's very few registers for the compiler to be working with. (The Athlon has a load of internal, microarchitectural 3DNow!/MMX/FP registers, so it can use register renaming to help alleviate some of the register starvation.  The PIII also has microarchitectural rename registers for this purpose, but the Athlon has more of them.)


MAJCI want to touch very briefly       on Sun's upcoming MAJC architecture, because it handles SIMD in a       completely different way than any of the above CPUs. Sun didn't take an       existing CPU design and add SIMD capabilities to it in the form of a       dedicated SIMD unit. What they did instead was integrate SIMD support       seamlessly into the design of the processor itself. 
 If your read my article       on Sun's MAJC, then you're familiar with the fact that it's a 4-wide       VLIW processor. The cool thing about MAJC is that all four of its       functional units are data-type agnostic. There are no integer units or       floating-point units. (More technically, there are no integer or fp logical       pipes, but there is, in fact dedicated hardware at the end of       each pipe.) Any FU can handle any type of data, which makes for       a number of interesting possibilities. (According to Sun's docs,       one of the FUs is an "extended subset" of the other three. So       they're not all totally identical, but nearly so.) One of the things you       can do with this is feed the same instruction stream to each FU, while       feeding four different data streams to each one to do       4-way, 128-bit SIMD integer or floating-point SIMD! 
  
Sun has already stated       that this is how they'll do SIMD with MAJC.  I'll be interested to       see some more details on the types of instructions that they'll implement.
 
ConclusionsThere is so much       more that can be said about these three SIMD implementations.  It       would be nice to be able to include a detailed breakdown/comparison of the       instructions each one offers, in order to get a more nuanced understanding       of the functionality each affords.  Also, I haven't even talked much       about instruction latencies or throughput for any of these architectures;       both of these factors greatly impact performance.  Finally, each       instruction set includes special instructions for manipulating the data       stream, manipulating the cache hierarchy, loading and storing vectors,       etc.  Including these factors in the discussion would almost double       the length of this article!  A follow-up that deals more       in-depth with these critical issues is in order.
Nevertheless, this       overview has covered some of the basic SIMD concepts and implementation       issues relevant for current personal computing market. Also, as I said at       the outset, this article is meant to be read in conjunction with my previous       architectural comparison.  Right now, I think       it's clear that AltiVec's SIMD implementation is the cleanest, most       extensible, and most powerful of the current lot.  However, Intel is       including the sequel to SSE, SSE2, in Willamette, so we'll have to see       what kind of advancements it brings with it.  Also, while it might       not be immediately relevant for the PC market, Sony's Playstation2 makes       heavy use of 128-bit SIMD calculations.  When (or if) I get my hands       on some tech docs for that, I'll be sure to do a write-up on it.
 
BibliographyI have to take a moment       to thank Walter NisticÚ for his help with this article. His feedback was       invaluable in helping me clarify some of the differences between the       implementations. I also want to thank Chris Rijk over at Ace's       for updating me on Sun's MAJC SIMD implementation. Now, on to the       bibliography...
Intel           Technology Journal, Q3 1997 Issue
Pentium           III Processor Implementation Tradeoffs
The           Internet Streaming SIMD Extensions
SIMD           Instruction Set Survey, Walter NisticÚ
The following white       papers are available from Motorola's website
Motorola's Altivec           Technology, Sam Fuller
AltiVec Technology (PPC-C1-AltiVec-990920.pdf)
The following white       papers are available from AMD
AMD Athlon Processor Technical Brief
AMD Extensions to the 3DNow! and MMX Instruction Sets Manual
3DNow! Technology Manual
AMD Athlon Processor Technical Brief

The PlayStation2 vs. the PC: a system-level comparison of two 3D platforms

By Jon Stokes | Published: April 15, 2000 - 10:00PM CT

Introduction

When I was in the research phase for my recent technical article on the Emotion Engine, I thought for a while there that I was never going to figure out how the PS2 worked. The PS2 is such a bizarre and powerful beast that it took me many hours of poring over articles and slide presentations just to get my bearings with it. Well, it seems I'm not alone in my struggle to understand the capabilities and limitations of one of the most painfully innovative pieces of hardware to hit the consumer market in quite a long time. I got some good feedback from PS2 developers who are also having a hard time with the new hardware. In fact, to get an idea of just how bad the situation is, check out this article on MSNBC, which discusses the difficulties that even the most experienced of console programmers are having in learning how to code for the PS2. When the programmers responsible for some of the greatest console games ever made say that the PS2's learning curve is steep, you know something's up.
In this article, I want to try and get a handle on some of the aspects of the PS2 that make it so fundamentally different from the current crop of PCs, and in the process shed some light on the difficulties developers will face in going from the PC to the PS2. Specifically, I'll look at the overall designs of the PS2 and PC, and explain how the demands of dynamic media processing have caused the design of the PS2 architecture to differ from the PC's to the point where developers will have to rethink how they move code and data around to render a 3D environment.
New wine and old bottles: Dynamic media vs. static applicationsTo kick off the discussion, I'm going to use a very prescient article published in 1996 by Keith Diefendorff and Pradeep K. Dubey entitled, "How Multimedia Workloads Will Change Processor Design." In this article, Diefendorff and Dubey argue that the processing of dynamic media will result in fundamental changes in processor design, and they detail what some of those changes will look like. In parts, it sounds as if they're looking ahead into the future and describing the PS2. Since this article describes the current situation so well, I'll be drawing on it to frame the discussion.
Diefendorff et al. start out by distinguishing media applications from more traditional applications by noting that media apps are examples of what they call dynamic processing. What this basically means is that the instruction stream doesn't really change all that fast, but the data stream changes constantly. Or, put more concretely, programs like 3D games and scientific applications deal with very large amounts of data, but the groups of instructions that operate on these large data chunks are usually very small. The most common situation is where you have a small loop that iterates through a large matrix or series of matrices many, many times.
Contrast this to a more traditional, static processing application like a word processor, which uses many different segments of code (menus, wizards, spell checkers, etc.) to operate on a single data file (the document). In this type of application, the data stream is pretty stagnant, and doesn't change very much. The instruction stream, however, is all over the map. One minute you're firing up the spell checker to process the file, the next minute you're changing the fonts, and then when that's done, maybe you export it to a different format like Postscript or HTML.
The PC was designed with just such spreadsheets, word processors and other static processing applications in mind. Within recent years, however, it has undergone some significant changes, particularly with the addition of special-purpose hardware like 3D accelerators and sound processors. Such added hardware, however, represents an attempt to put new wine in an old bottle. At some point, Diefendorff and Dubey predict, all media processing functionality will be integrated on a single die, and specialized DSP processing hardware will become obsolete. Furthermore, such designs will feature extremely wide data paths between the on-die components. Enter the PS2...

Caching, Bandwidth, and 3D renderingThis change in the nature of the  instruction and data streams has significant implications for system design. One  of the first things to be affected is the cache. Consider the following  illustration of caching for a static application.
 
 
In the above diagram, you'll notice that there's a steady flow of instructions through the instruction cache, resulting in a high turnover rate. Instructions don't stay in the cache long before getting booted out by the next instruction that the machine needs. The data cache, on the other hand, can collect up the most commonly used pieces of data and just hold onto them while all those instructions in the instruction stream process them. You can just drop a whole piece of data in there, like a spreadsheet or a document, and leave it there while you run a steady stream of instructions over it to modify it. In such a situation, we say that the data exhibits high locality of reference, whereas the instructions exhibit lower locality of reference. 
In contrast, a dynamic media app has the opposite cache usage behavior. And in fact, the problem is exacerbated by the fact that a media app pushes data through the data cache much faster than a static app pushes code through the instruction cache. 
So while static apps can sometimes make poor use of the instruction cache, media apps almost always make extremely poor use of the data cache. There's just too much data to be processed in too short of an amount of time to be sticking it in a cache and leaving it there. The cache acts as more of a brief stopping-off point for data than as a real cache. However, media apps have excellent locality of reference when it comes to instructions. Most of that data that moves through the cache is processed by loops and other very small bits of code, which are often small enough to fit in the instruction cache and stay there. So the instructions just hang out in the cache and monotonously grind away at all that data that's flying by them.
The other, major difference between  static apps and dynamic apps are their bandwidth needs. Since a static app can  drop all its instructions and data into a cache without worrying too much about  needing to fetch some more anytime soon, systems designed for such applications  feature large caches connected by relatively low bandwidth pipes. Dynamic apps,  on the other hand, can make do with smaller caches, but since they transfer so  much data they need much more bandwidth between them.
Here's a goofy example to help you  visualize what I'm talking about: imagine a series of large buckets, connected  by pipes to a main tank, with a cow lapping water out of each bucket. Since cows  don't drink too fast, the pipes don't have to be too large to keep the buckets  full and the cows happy. Now imagine that same setup, except with elephants on  the other end instead of cows. The elephants are sucking water out so fast that  you've got to do something drastic to keep them happy. One option would be to  enlarge the pipes just a little (*cough* AGP *cough*), and stick insanely large  buckets on the ends of them (*cough* 64MB GeForce *cough*). You then fill the  buckets up to the top every morning, leave the water on all day, and pray to God  that the elephants don't get too thirsty. This only works to a certain extent  though, because a really thirsty elephant would still end up draining the bucket  faster than you can fill it. And what happens when the elephants have kids, and  the kids are even thirstier? You're only delaying the inevitable with this  solution, because the problem isn't with the buckets, it's with the pipes  (assuming an infinite supply of water). A better approach would be to just ditch  the buckets altogether and make the pipes really, really large. You'd also want  to stick some pans on the ends of the pipes as a place to collect the water  before it gets consumed, but the pans don't have to be that big because the  water isn't staying in them very long.

3D and caching on the PCThe above analogy, silly though it seems, sums up one of the primary differences between the design of the PS2 and that of a PC. 3D games are just the sort of dynamic media apps that the PC wasn't designed to cope with--they're the elephants in the analogy. PC game programmers operate in a world of small pipes and large buckets, and they design games to fit that paradigm. Let's take a look at the architecture of the PC, particularly the caches and connections between them, from a 3D programmer's perspective.
 
Note that I didn't include the caches on the accelerator's core itself; I only included the VRAM. I did this for a number of reasons. First and foremost, it doesn't really impact our discussion much--you'll see why once we get to the PS2. Second, detailed information on the innards of most 3D chips is kind of hard to come by. Finally, this article is trying to give a more general overview, without getting into to much detail. So just assume that there's cache hanging around inside the accelerator that I'm ignoring.
Let's now go step-by-step through the process of producing a couple of frames of 3D on the above, non-hardware T&L PC system. We'll assume that the application code and the data all fit in main memory. 
Geometry StageThe first stage of the process is called the geometry stage. This is where a description of the object and its position in the 3D world are created. This description, called a display list, is a sequence of commands, parameters, and other data that can be further processed by the next stage. In order to create the display list, the CPU has to first get the 3D engine code and the data out of main memory and load it into the L1 and L2 caches to be worked on. So the L1 and L2 caches act as sort of a workspace where the CPU can keep code and data in order to do tessellation, setup, transformation, and lighting. These display lists eventually get written back out to main memory before going on to the next stage. In the geometry stage, the FSB and the main memory bus see the most traffic, because all that data is being shuffled back and forth between the CPU caches and RAM.

 
Rendering StageWhile the geometry stage is producing one frame, the rendering stage is producing the bitmap of the next frame. This bitmap is a 2D, pixel-by-pixel representation of the 3D scene, which will eventually be drawn on the screen as a frame. The actual rendering of the 2D bitmap from the 3D scene is handled by the video card, so this means that the display lists produced by the CPU have to make their way out of main memory, across the AGP bus, and into the video card's video memory. Here, the vid card builds the bitmaps in the frame buffers by executing the display lists. The display lists tell the vid card to draw triangles and lines, do coloring and shading, apply texture maps, etc. The application of texture maps means moving the textures out of main memory and into the video memory, which even further stresses the AGP and main memory buses.
 
Display StageFinally, in the display stage, the next frame is being painted onto the screen. This painting involves fetching data from the frame buffer and converting that digital pixel data into a stream of analog signals that the monitor can understand. This stage makes heavy use of the video memory and the video memory bus.

 
You should be able to tell from the above, very general description just how closely the 3D rendering process is fitted to the PC's architecture. To further nuance the description, let's look at the overall division of labor for the rendering process on a standard PC. First up, here's a table that gives you a very general idea of how much cache is available to the rendering pipeline.
                    Pentium III, 32MB TNT2 Ultra                 
         L1 cache (instruction, data)         16K 4-way, 16K 4-way       
         L2 cache         256K unified       
         Video Memory         32MB       
         Total         33,056K       
       
 
Now let's see how that cache is divided up geographically and functionally.
 
As we'll soon see, the above division of labor doesn't quite work for the PS2. In fact, that's a major understatement, so let me rephrase: it doesn't work at all. The PS2 requires you to rethink how you divide up the labor between the stages of the rendering pipeline, and how you move code and data around between those stages.

Caching on the PS2First, let's take a look at the PS2's bus and cache layout.
 
What I've tried to represent in the above figure is that the caches are much smaller than those on the PC, but the buses connecting them are much wider. Again, I'm sure that if I had access to the details of the Graphics Synth's architecture, I'd almost certainly find that it has cache inside of it, too, in various places.
In place of the PC's north bridge, the PS2 has a 10-channel Direct Memory Access Controller (DMAC) that coordinates data transfers between the units and caches on the Emotion Engine's 128-bit and 64-bit internal data paths. The PS2 also uses two, 128MB RDRAM banks for its main memory, each of which is connected to the EE's on-die DMAC by a high-speed 16-bit bus. RDRAM has the virtue of being an extremely high-bandwidth memory solution, so it can keep that 10-channel DMAC (which can manage 10 simultaneous bus transfers) busy and those internal caches fed.
Speaking of the PS2's internal caches, let's look at their sizes, and how they stack up to a PIII's caches, especially with respect to the first few stages of the rendering pipeline.
                    Playstation2                 
         L1 cache
          (instruction,data)         16K 2-way instruction, 8K 2-way data       
         SPRAM         16K       
         VU0 cache
          (instruction + data)         16K + 16K = 32K       
         VU1 cache
          (instruction + data)         16K + 16K = 32K       
         Video memory         4MB       
         Total         4,200K       
       
 
As you can tell from the above chart, the PIII system outlined earlier has almost 8 times the amount of cache in its rendering pipeline than the PS2 does. Even if we take the video memory and the cache on the accelerator card completely out of the equation, the PC still has about 3 times the cache of the PS2. Furthermore, what little cache the PS2 has is divided up among a larger number of small caches. Remember the zoo analogy: large buckets and small pipes vs. small pans and large pipes.
The PS2's approach is causing developers to rethink how they move data inside the machine. In a comment in the /. thread about my PS2 article, one ex-PS2 developer noted that the VU caches are too small to store a whole model or 32-bit texture, so programmers were pulling their hair out trying to figure out how to deal with the size limitation. He pointed out that one group that had had PS2 development units for a while took the strategy of constantly downloading textures and models into the VU and processors, instead of downloading them once, caching them, and working on them inside the cache. This approach was running the 10-channel DMAC at 90% capacity! This kind of aggressive use of bandwidth resources is exactly the kind of thing PS2 developers will have to do. Between the RAMBUS memory banks, the 10-channel DMAC and the 128-bit internal data bus, the PS2 has bandwidth to burn--what it doesn't have is internal cache. Currently, developers are thinking in terms of 3D cards with large on-board memory that can cache large models and textures, and modestly sized L1 and L2 caches for storing code and data.
The PS2 is the exact opposite, though. There's memory-to-processor bandwidth out the wazoo. The RIMMS are the cache, and the available bandwidth is such that you can get away with storing everything there and downloading it on the fly. So with the PS2, code and data have to be constantly streamed over the wide internal buses in order arrive at the functional units right when they're needed. Of course, then the trick is scheduling the memory transfers so that you always have what you need on hand and latency doesn't kill you. I'm not so sure how developers will tackle this, but it'll be interesting to see what techniques they'll use. I'm sure the PS2 has some sophisticated prefetching hardware that's not discussed in any of the documentation I have.


SIMD on the PS2 and conclusionsDiefendorff and Dubey point out one more important way that dynamic media processing will affect system design. They note:
"Input data streams are frequently large   collections of small data elements such as pixels, vertices, or   frequency/amplitude values. The parallelism in these streams is fine grained.   And because elements of these large input data streams tend to undergo   identical processing (filtering, transformations, and so on), it lends itself   to machines with SIMD hardware units operating in parallelÖfor media   processing, simple SIMD execution units with wide data paths would be able to   achieve significant speedups without this enormous complexity." (p.2)
What they're saying is that media applications exhibit a very high degree of data parallelism, much more so than static apps. Static applications, on the other hand, exhibit varying degrees of instruction-level parallelism, but little data parallelism. Here's a diagram that shows a static application at work.
 
In contrast, dynamic media processing exhibits very little instruction-level parallelism and massive amounts of data parallelism. What instruction stream parallelism a dynamic app does have usually happens at the thread level, and not at the individual instruction level. As a result, such apps lend themselves to SIMD processing. 
 
The PS2 has two dedicated 128-bit SIMD floating-point vector units, VU0 and VU1, each of which are able to process massive amounts of data per clock cycle. In addition, the MIPS III CPU core on the Emotion Engine can do 128-bit integer SIMD by locking together its two 64-bit integer pipes. So the PS2 fits the profile of having multiple parallel SIMD units connected by high-bandwidth pipes.
Again, the issue is how to make use of these resources. PC programmers aren't used to having access to that kind of raw, data processing power. In fact, from what I've heard, very few developers are using both vector units.
We're beginning to see vector processing take off on the PC, and I have no doubt that it will be very prevalent in the next few years. However, until the PC is able to overcome its bandwidth bottlenecks it won't be able to keep its SIMD units fed as well as the PS2. 
 
ConclusionsSo which machine is more powerful? Well, if you're talking 3D gaming and you mean right now, I wouldn't hesitate to give the crown to the PS2. Looking ahead to the next two or three years, the future looks much less certain. It'll be quite a while before developers are able to figure out how to harness the full capabilities of the PS2, and while they're scratching their heads the PC will be getting more and more powerful. As we'll soon see with the NV15, the PC is still advancing quickly under the old "large buckets and slightly bigger pipes" paradigm. However, the PS2 represents the true next generation of media processing, and until the PC catches up with it in terms of bandwidth and overall data throughput (read "SIMD") it can't be worthily called a true dynamic media machine. That being said, a look at the PS2 is a look into the (probably near) future of the PC. The data pipes will indeed get wider, SIMD will increase the amount of media data a PC can process, and the PC will resemble more and more the kind of media machine that Diefendorff and Duby described, and that the engineers at Sony and Mitsubishi built.
 
BibliographyKeith Diefendorff and Pradeep K. Dubey, "How Multimedia Workloads     Will Change Processor Design." Computer, September 1997   
Stephen P. VanderWiel and David J. Lija. "When Caches Aren't Enough:     Data Prefetching Techniques." Computer   
Bruce Shriver, Bennett Smith. The Anatomy of a High-Performance     Microprocessor: A Systems Perspective. Los Alamitos, CA: IEEE Computing     Society Press, 1998   
Sound     and Vision: A Technical Overview of the Emotion Engine, Jon Stokes.

訂閱：意見 (Atom)

Upper Instruction	Lower Instruction
`MUL VF04, VF03, Q`	`DIV Q, 1.0, VF02.w`
`MUL ACC, VF10, VF01.x`	`MOVE VF03, VF02`
`MADD ACC, VF11, VF01.y`	`ADD VI03M VI03, -1`
`MADD ACC, VF12, VF01.z`	`NOP`
`MADD VF02, VF13, VF01.w`	`LQ VF01, VI01++`
`NOP`	`BGTZ VI03, LOOP`
`NOP`	`SQ VF04, VI02++`

Clock	250 MHz
VDD	1.8v
Design Rule	0.25 um
Gate Length	0.18 um
Power	13 watts
Transistors	10.5 million
Die size	17 mm x 14.1 mm (240 mm²)
Package	540-pin PBGA (Ball Grid Array)
Layers	4-layer metal

	ALU	MAC0	MMI	Branch	COP1 oper.	COP2 oper.
ALU	X	X	X	X	X	X
MAC1	X	X	X	X	X	X
LZC	X	X	X	X	X	X
Ld/St	X	X	X	X	X	X
SYNC	X	X	X	X	X	X
Branch	X	X	X		X	X
ERET	X	X	X		X	X
COP0 ld/mov	X	X	X	X	X	X
COP1 ld/mov	X	X	X	X	X	X
COP2 ld/mov	X	X	X	X	X	X

Floating-point Divide	7 cycles
Square Root	7 cycles
Inverse Square Root	13 cycles

Upper instructions	Lower instructions
4 parallel FP ADD/SUB 4 parallel FP MUL 4 parallel FP ADD/MSUB 4 parallel MAX/MIN Outer product calculation Clipping detection	FP DIV/SQRT/RSQRT Load/Store 128b data EFU (1 FMAC + 1 FDIV) Jump/Branch Random number generator/misc

Pentium III, 32MB TNT2 Ultra
L1 cache (instruction, data)	16K 4-way, 16K 4-way
L2 cache	256K unified
Video Memory	32MB
Total	33,056K

Playstation2
L1 cache (instruction,data)	16K 2-way instruction, 8K 2-way data
SPRAM	16K
VU0 cache (instruction + data)	16K + 16K = 32K
VU1 cache (instruction + data)	16K + 16K = 32K
Video memory	4MB
Total	4,200K

明明就是Blog

2008年5月9日 星期五

Sound and Vision: A Technical Overview of the Emotion Engine

Preface

Part I: General Playstation 2 Overview

3D rendering and DSP basics

Background: 3D rendering and DSP basics

The Emotion Engine: Basic Architecture

The CPU/FPU/VU0 team

The VU1/Graphics Synth team

Putting the teams together

The MIPS III CPU Core

Vectors

Vector Unit 0

Vector Unit 1

Programming the VU

Conclusion

Bibliography (and props)

SIMD architectures

Introduction

SIMD basics

SIMD operations

I. Intra element arithmetic and non-arithmetic functions.

II. Inter Element Arithmetic

IV. Inter Element Non-arithmetic

Saturated arithmetic

AltiVec

Instructions

Interrupts

The Story of MMX

The elements

Sequel: MMX2/SSE/KNI

Interrupts

3DNow! and Advanced 3DNow!

MAJC

Conclusions

Bibliography

The PlayStation2 vs. the PC: a system-level comparison of two 3D platforms

Introduction

New wine and old bottles: Dynamic media vs. static applications

Caching, Bandwidth, and 3D rendering

3D and caching on the PC

Geometry Stage

Rendering Stage

Display Stage

Caching on the PS2

SIMD on the PS2 and conclusions

Conclusions

Bibliography

分類

網誌存檔

2008年5月9日星期五