Memory Considerations
Almost all 3D graphics hardware feature a frame buffer, a depth buffer and of course a texture buffer. A stencil buffer may also be present. All of this buffers take up your graphics card's memory. Since it's limited, we can only use so much of it. Most of the time, the frame buffer (taking into account double or triple buffering) and depth buffer take first priority. The frame buffer holds the final output, the one you'll see on your screen. This will of course, be written to constantly. The depth buffer will also be frequently read and written to. So, what ever memory the graphics card have left is used for the texture buffer.If you're graphics card memory is quite limited, there might not be enough space to store all the textures (because most of it are taken by the frame and depth buffer). This will cause performance issues, more so if you're playing a game that uses lots of textures. Large, high detailed textures may force the graphics card to store some of them in main memory (RAM), which is a lot slower than the local graphics card memory Texture compression may help, but not by much. In this case, we have to try to fit all textures inside the graphics card memory. This was once a primary concern, but since graphics cards nowadays come with lots of memory (the largest available today is 512 MB/s), capacity is no longer an issue.
What's more important now is bandwidth. If you remember, bandwidth are influenced by two things - memory speed in Hz and bus width. They bigger the bus and / or the faster the memory, the bigger the bandwidth. Even if you can fit all the textures inside the graphics card memory, you still need lots of bandwidth to actually transfer the textures back and forth to the chip. This is one of the main reasons why low end / entry level graphics card is better suited for resolution lower than 1024 x 768, 32 bit color. Both the capacity (128 MB) and bandwidth (64 or 128 bit bus) of these cards can't keep up with the demands of a higher resolution. Even more so if you're planning to play with lots of effects and features such as anti-aliasing and anisotropic filtering
To illustrate what's going on, let's calculate some numbers: we know that a resolution of 1024 x 768, 32 bit color takes up about 3 MBs. If we're planning to hit a constant frame rate of about 60 fps, roughly speaking we will need 60 sec x 3 MBs, which is 180 MB/s. Don't forget there's the depth buffer - that's another 135 MB/s. Of course, we also need bandwidth for textures. Most graphics cards today can hold and process a texture up to 2048 x 2048, 32 bit color - roughly 16 MBs in size. Textures used for some game models are limited to 512 x 512, 32 bit color. So, we can actually fit 16 such textures for our 2048 x 2048 texture. Of course, we also have to factor filtering into the equation, which can use 4 samples (bilinear) to 16 or even 32 samples - a total of 30 GB/s (16 MBs x 60 sec x 32 samples) of bandwidth! That's a total of 31 GB/s of bandwidth in all! With Z and texture compression, we can actually get away with less bandwidth, although there's some additional latency.
A GeForce 6800 has around 33.6 GB/s of bandwidth, so it's enough for our example. Compare that to the 'measly' 14.4 to 16 GB/s of bandwidth of its sibling, the GeForce 6600. If you don't remember, the main difference between the two is the bus width (256 vs 128 bit), the memory type and clock is about the same. So, if we want to use the same setting, we have to compromise a bit (even after using Z and texture compression). For example, one of the compromise we can make is to only use a maximum of 16 samples for filtering on the GeForce 6600, which is good enough for most people. This means we only need 15.36 GB/s of bandwidth (16 MBs x 60 sec x 16 samples), just right for the GeForce 6600.
Rasterization Performance Considerations
Even all the bandwidth and memory in the world won't help performance if your graphics card is slow to begin with. Just like your main processor, we also need a balance between available bandwidth and the processing power of the graphics chip. We usually define how much processing power by how many pixels or texels the chip can process per cycle (or per clock). The more pixels we can draw, the faster we can draw the entire frame buffer.Texel and pixel processing are usually done inside the graphics chips. Just like your processor, these chips are made up of tiny units, some are called TMU or Texture Management Units. Additional units will let the chip split the workload, instead of using just a single unit to process them all. After the texels are processed, they can then be passed to dedicated filtering units or internally filtered if the TMUs is capable of doing so. Before shaders arrive, texel processing is actually very easy, so these units are quite simple and small. With better fabrication and design, we can put more than one TMU inside a graphics chip. Of course this effectively doubles the pixel processing power.
TMUs have also become faster and more powerful. Arranging TMUs in an array - a 'pipeline' - allows us to either process a pixel and get additional filtering for free or two texels with an additional pass for filtering. Under this scenario, multitexturing can now be done in a single pass. With 4 'pipelines', we can get either 4 pixels (single textured texels) with trilinear filtering or 4 multitextured pixels with bilinear filtering. Nowadays, it's not that rare to see graphics card which can process 8, 12 or 16 texels/pixels per cycle.
When full fledge shader hardware arrived, these TMUs are effectively replaced by more generalized, powerful ALUs (Algorithmic Logical Units). They can still function just like your basic TMUs, but they offer more flexibility because they can process instructions - shaders - as well. Unlike with multitexturing that only has simple, limited processing, some shaders can be short or long, taking several cycles to complete. So, to maximize performance, these ALUs are design to process several instructions at a time. A simpler way of putting it would be like this: if an instruction needs 3 cycles to complete, than processing 3 of them in a single cycle means they effectively can be done in a single cycle. We also get the advantage of doing 3 short instructions in a single cycle! To maximized even more from these ALUs, they can 'pack' several bits of data together in one large chunk. So, instead of processing four 32 bit texels at a time, they can put them together as one 128 bit chunk. This method, often called SIMD (Single Instruction Multiple Data) have been used by processors for a long time (MMX, 3DNow!, SSE, Altivec).


Now, let's go back to our previous example, the GeForce 6800. The chip on this graphics card can process 12 or 16 texels per clock. Since it's running at 400 MHz, that gives us about 4.8 to 6.4 billion texels per second. Remember a texel is actually a pixel in the texture - texture element. Since the graphics card can process that much texels, it needs 19.2 to 25.6 GB/s (4.8 and 6.4 billion x 32 bits) of bandwidth for the textures. 33,6 GB/s that it has is more than enough (with some additional headroom for other tasks).
On the other hand, the GeForce 6600 can only process 8 texels per clock. It's running at 500 MHz, providing us with 4 billion texels per second. The needed bandwidth should be around 16 GB/s, enough for the 128 bit bus with 500 MHz DDR memory (effectively 1 GHz). This is preferred than a 256 bit bus with 250 MHz DDR memory, because making a 256 bit bus graphics is more expensive. The memory for both cards are basically the same, so manufacturers don't have to place separate orders, plus they can supplement memory needs. If demand is high for the 6600, the memory allocated for the 6800 can be used first and vice versa.
The GeForce 6800 uses four ALUs that can process up to four texels per ALU, so it can process shaders as well. Several instructions can be done in a single cycle since the ALU can process two instructions per cycle (with some restrictions). More complex ones might require several cycles. Under the traditional 'pipeline' assumption, GeForce 6800's ALUs basically allows us to calculate either 1 anisotropic filtered pixel with 16 samples, 2 trilinear filtered pixels or 4 bilinear filtered pixels. The GeForce 6600 is equipped with a different, although similar ALU. It uses a single ALU capable of processing eight texels per cycle, but optimized to produce four pixels per cycle. This arrangement, much like the old TMU array, allows it to process some instructions (for shaders) or perform filtering / multitexturing more effectively. One advantage that can be seen clearly is we get either 1 relatively free trilinear filtered pixel or 1 anisotropic filtering up to 16 samples (in two passes) with almost no performance drop. That's very good for a card with half the pixel processing performance of the GeForce 6800.
Confused? To put it simply, the processing power of any graphics card is very much related to how many texels it can process per cycle. A general consensus is that the more TMUs (or ALUs), the better. But the way we organize TMUs and their capabilities also influence performance. Remember that on the screen, we're viewing at least bilinear filtered textures, not just point sampled ones. If we 'stack' TMUs into an array, we get more multitexturing power and possibly trilinear filtering for free. Anisotropic filtering does need another pass, but we can get additional samples for free with the second TMU. Combined with enough speed and memory bandwidth, we can get relatively free anisotropic filtering! Without a stacked TMU, we must do another pass for filtering each texels, even more pass is needed if we're going to do trilinear and of course anisotropic filtering. But a high enough clock and the sheer number of texels we can produce per cycle alleviates the penalty for multiple passes.
[Previous Page]
[Go to top]
[Home]