Pushing the Frame
Rate Envelope
Last month, ATI finally launched their Radeon X1900
series, formerly known as R580. From a feature stand point,
it's not that much different from their very late-to-market
Radeon X1800 series. However, there is one very distinctive
difference between the two - the number of pixel shader
units. Whereas the Radeon X1800 has 16, the Radeon X1900
comes equipped with 48 shader units. The reason: to push
shaders as fast as possible.ATI's reasons for going bonanza with shaders is not without reason or deep thought. Newer games rely more and more on shaders instead of textures. But that's not the only reason. Textures not only take up space but is very heavy on bandwidth as well. The burden can even be too much for the current generation of graphics cards, even high end monsters like the GeForce 7800GTX or Radeon X1900 XT. GDDR3 memory clocks is already near its upper limits and that also apply to memory controllers in use by these cards as well. Next year's games or even this year's will soon come, with more detail, be it with textures or shaders (or a mixture of both).
Beyond3D recently had an interview with Eric Demers concerning ATI's design decision with the Radeon X1900 (which you can read here). It certainly raises some interesting questions and facts, which could be useful to predict where we will be going to in the next year or so. In particular interest to us, is of course the decision to go with more pixel shader units and maintain the number of ROP units. On the other hand, NVIDIA seem poised to extend the G70 - the GeForce 7 series - architecture further by adding more G70-like quad.
So, we decided to play around with a GeForce 7800GTX some more. This time, we took it for a spin to see the influence of quads, core and memory clocks on current games. This article started out as something to do on the weekend, but it quickly become more and more interesting (at least to us) as testing progresses.
Getting Ready
From the current crop of games, there are two games which we feel correspond to the texture dependent vs shader heavy games perfectly: Quake 4 and F.E.A.R. Despite having lower hardware requirements, F.E.A.R is definitely heavier of the two. Quake 4 on the other hand is still (loosely) based on the Doom 3 engine, which in turn still uses shaders from the pre SM 3.0 era. Both games looked amazing when viewed in full detail and gamers and reviewers alike used both extensively for making graphics card purchase decisions.There's also something else both games share. Even with a GeForce 7800GTX or a Radeon X1800XT, you won't get high enough playable frame rates with AA and AF above 1024 x 768. Now that's heavy. The only way to get faster frame rates short of overclocking the card is by using multi rendering solution from both ATI and NVIDIA, Crossfire and SLI, respectively. But as we pointed out in our last article, this is not always the best solution.
Now that we have the games, we must choose how to enable / disable quads inside our sample of GeForce 7800GTX and change core and memory clocks. Since we don't want to do any hard modding (it is still a sample after all), we decided to rely on Alexey Nikolaychuk a.k.a Unwinder RivaTuner. Some might ask why not just use the GeForce 7800GT and newly released GeForce 7800GS. Well, this is not meant to be a review of those cards. We wanted to see what kind of influence one or two quads have on performance, both on different and similar clocks.
Performance
The test setup is pretty much the same like our last articles. AA and AF settings are applied in game, with image settings set to High Quality on the driver panel and no optimizations are enabled. Quake was set to High Quality, while bilinear filtering was set with the command 'image_filter GL_LINEAR_MIPMAP_NEAREST'. 4x AA was used, set with the command 'r_multisamples 4' and AF was set with 'image_anisotropy n' with n being the number of samples. F.E.A.R was set to the usual settings, all details are set to maximum, with Soft Shadows off. System related settings (physics etc) was set to minimum.We'd like to thank both Tagan and Kingston for supplying with the additional power supply and 1 GB memory modules for this article.
Our test setup
AMD Athlon 64 3500+ socket 939
2 x 1024 MB Kingston KVR 3-3-3 PC3200 DDR-SDRAM
MSI K8N NForce 4 SLI motherboard
ASUS GeForce 7800GTX DDR3 256 MB graphics card
(running at standard clocks - core 486 MHz / memory 675 MHz (1350 MHz effective))
Maxtor DiamondMaxPlus9 80 GB Serial ATA 8 MB buffer
ASUS E-616 DVD-ROM
Tagan TG530-U15 530 watts ATX/BTX power supply
Windows XP Professional with Service Pack 2 installed
NVIDIA Forceware 81.98 reference driver
NVIDIA NForce 4 6.70 reference driver
Creative SoundBlaster Live! 24 bit 5.12.1.512 driver.
DirectX 9.0c
Performance
The results:
Quake 4
Quake 4 - Data Processing Plant, 1024 x 768 (6 PS
Quads)
Bilinear
Trilinear
2x AF
4x AF
8x AF
16x AF
4x AA Bilinear
4x AA 16x AF
Quake 4 - Data Processing Plant, 1024 x 768 (5 PS
Quads)
Bilinear
Trilinear
2x AF
4x AF
8x AF
16x AF
4x AA Bilinear
4x AA 16x AF
Quake 4 - Data Processing Plant, 1024 x 768 (4 PS
Quads)
Bilinear
Trilinear
2x AF
4x AF
8x AF
16x AF
4x AA Bilinear
4x AA 16x AF
One glance at the results and you can see they are mostly system limited. Only with AA and AF enabled did we see a significant difference in frame rates. Because of this, we will only be looking at AA and AF results from Quake 4. Now, let's look closely at the results with 4, 5 and 6 PS quads. You can use the table below to see the numbers more clearly.
|
|
6 Quad |
|
|
|
|
|
|
|
|
|
Bilinear | Trilinear | 2xAF | 4xAF | 8xAF | 16xAF | 4xAA Bilinear | 4xAA 16xAF |
| Min | 69 | 69 | 68 | 68 | 68 | 69 | 70 | 64 |
| Avg | 122.17 | 122.09 | 122.74 | 121.75 | 121.56 | 121.51 | 105.6 | 97.29 |
| Max | 165 | 159 | 168 | 165 | 162 | 165 | 167 | 153 |
|
|
|
|
|
|
|
|
|
|
|
|
5 Quad |
|
|
|
|
|
|
|
|
|
Bilinear | Trilinear | 2xAF | 4xAF | 8xAF | 16xAF | 4xAA Bilinear | 4xAA 16xAF |
| Min | 69 | 69 | 70 | 69 | 69 | 68 | 71 | 62 |
| Avg | 123.25 | 123.02 | 123.17 | 122.58 | 122.06 | 121.33 | 104.63 | 94.2 |
| Max | 168 | 170 | 162 | 168 | 163 | 164 | 170 | 144 |
|
|
|
|
|
|
|
|
|
|
|
|
4 Quad |
|
|
|
|
|
|
|
|
|
Bilinear | Trilinear | 2xAF | 4xAF | 8xAF | 16xAF | 4xAA Bilinear | 4xAA 16xAF |
| Min | 67 | 68 | 69 | 69 | 69 | 69 | 68 | 59 |
| Avg | 122.01 | 121.99 | 122.01 | 122.04 | 121.11 | 120.57 | 103.01 | 90.08 |
| Max | 170 | 167 | 161 | 170 | 171 | 162 | 169 | 138 |
Of course, on a more powerful system the results will be slightly different. Focus on AA and AF results, where we're not as system limited as with just bilinear or even 16x anisotropic filtering. 'Pure' AA performance is pretty much similar with 4, 5 or 6 quads, the difference is around 2 - 3 fps overall. So, having more quads won't give you higher frames in Quake 4 (or games that depend mostly on textures). Well, at least in pure AA with bilinear. However, it seems that having more quads does help performance with AA and AF enabled at the same time. The difference is not noticeable in gameplay, but it's interesting nonetheless. Let's see just how much performance we gain with each quad.
|
|
Bilinear | Trilinear | 2xAF | 4xAF | 8xAF | 16xAF | 4xAA Bilinear | 4xAA 16xAF |
| 5 to 6 Quad |
|
|
|
|
|
|
|
|
| Min | 0.00% | 0.00% | -2.86% | -1.45% | -1.45% | 1.47% | -1.41% | 3.23% |
| Avg | -0.88% | -0.76% | -0.35% | -0.68% | -0.41% | 0.15% | 0.92% | 3.28% |
| Max | -1.79% | -6.47% | 3.70% | -1.79% | -0.61% | 0.61% | -1.76% | 6.25% |
|
|
|
|
|
|
|
|
|
|
| 4 to 5 Quad |
|
|
|
|
|
|
|
|
| Min | 2.99% | 1.47% | 1.45% | 0.00% | 0.00% | -1.45% | 4.41% | 5.08% |
| Avg | 1.02% | 0.84% | 0.95% | 0.45% | 0.79% | 0.63% | 1.57% | 4.57% |
| Max | -1.18% | 1.80% | 0.62% | -1.18% | -4.68% | 1.23% | 0.59% | 4.35% |
|
|
|
|
|
|
|
|
|
|
| Per Quad (Average) |
|
|
|
|
|
|
|
|
| Min | 1.49% | 0.74% | -0.70% | -0.72% | -0.72% | 0.01% | 1.50% | 4.16% |
| Avg | 0.07% | 0.04% | 0.30% | -0.12% | 0.19% | 0.39% | 1.25% | 3.92% |
| Max | -1.48% | -2.34% | 2.16% | -1.48% | -2.65% | 0.92% | -0.59% | 5.30% |
Here we can see the move from 4 to 5 quads provide the highest increase in performance, both with AA and AF enabled and 'pure' AA, although 1 % higher is far from significant (with pure AA). While 6 quads is faster, the increase is actually smaller - probably due to limitations in memory bandwidth (which we'll look into later). What's interesting is that AA and AF gains can be up to three times 'pure' AA gains. This confirms our previous suspicion that having more quads have little to do with the GeForce 7's (and probably GeForce 6 as well) performance with AA, but it will help with AA AND AF. You can say that with each quad, performance under AA and AF gets closer to 'pure' AA levels - giving us almost 'free' anisotropic filtering.
OK, let's see the numbers on the first table in a different light.
| 6 Quad |
|
|
|
|
| AA (with Bilinear) | AA (with 16x AF) | 16x AF (with Bilinear) | 16x AF (with AA) | |
| Difference (Min) | -1.45% | 7.25% | 0.00% | 8.70% |
| Difference (Avg) | 13.56% | 19.93% | 0.54% | 6.37% |
| Difference (Max) | -1.21% | 7.27% | 0.00% | 8.48% |
|
|
|
|
|
|
| 5 Quad |
|
|
|
|
| AA (with Bilinear) | AA (with 16x AF) | 16x AF (with Bilinear) | 16x AF (with AA) | |
| Difference (Min) | -2.90% | 8.82% | 1.45% | 11.72% |
| Difference (Avg) | 15.11% | 22.36% | 1.56% | 7.25% |
| Difference (Max) | -1.19% | 12.20% | 2.38% | 13.39% |
|
|
|
|
|
|
| 4 Quad |
|
|
|
|
| AA (with Bilinear) | AA (with 16x AF) | 16x AF (with Bilinear) | 16x AF (with AA) | |
| Difference (Min) | -1.49% | 14.49% | -2.99% | 15.99% |
| Difference (Avg) | 15.57% | 25.28% | 1.18% | 9.72% |
| Difference (Max) | 0.59% | 14.81% | 4.71% | 14.23% |
We think this table sums it up pretty well. Running at the same clock on both core and memory, having more quads will (only) help the GeForce 7 anisotropic filtering performance. Look at the difference between 4, 5 and 6 quads - that's about 2 percent per quad overall. You might say, it's closer to 1 percent with 6 quads, but that's very likely caused by limited memory bandwidth. That's something we can work around - just clock the memory even higher.
The question is how much? Unfortunately, this particular card is only stable at around 675 MHz or 1350 MHz effective memory clock. At 1400 MHz, not only are scores lower, but artifacts begin to appear. So, we decided to tackle this matter in a different manner - by clocking down the card. We clocked the card's core from 250 MHz to 500 MHz in 50 MHz increments and the memory from 350 MHz to 650 MHz in 50 MHz increments (or 700 MHz to 1300 MHz effectively in 100 MHz increments). On core measurements, we set the memory clock to 1200 MHz - the default reference clock, while on memory measurements, the core was set to 500 MHz.
Along the way, we decided to test with 4 and 5 quads as well. This way, we'll get some idea just how high memory clock we need to have parity between pixel processing and memory bandwidth with various number of quads. All of these numbers are in frame rates per second.
Quake 4 - Data Processing Plant, 1024 x 768 4xAA 16x AF
(6 PS Quads)
650
600
550
500
450
400
350
300
250
Quake 4 - Data Processing Plant, 1024 x 768 4xAA 16x AF
(5 PS Quads)
650
600
550
500
450
400
350
300
250
Quake 4 - Data Processing Plant, 1024 x 768 4xAA 16x AF
(4 PS Quads)
650
600
550
500
450
400
350
300
250
While there are variations between results, we can see that for the most part, we need to have at least a 100 MHz difference between the core and memory to maintain optimal combination of pixel processing and memory bandwidth. That's true whether the card has 4, 5 or 6 quads. If we have less quads, the memory clock must be even higher than 100 MHz to maintain that parity - with 5 quads we have to run at 500 / 1300 MHz to keep up with 6 quads running at 500 / 1200 MHz/ That's a 150 MHz difference. For 4 quads, that will be 500 / 1350 MHz or 500 / 1400 MHz (or a 200 MHz difference).
This fact can be seen in another way: even higher memory clocks can provide us with even higher frame rates. It makes sense - we saw earlier that the GeForce 7 still has some elbow room with AF, which needs lots of bandwidth due to the extensive amount of texture samples. However, there's a limit to that - we will only see around 4 to 5 fps at most with each 100 MHz increase in memory clock. That increase will most likely be smaller with higher clocks. If we have to speculate, it's very likely that if we we're to use the standard core clock of ASUS EN7800GTX TOP, we will probably get maximum performance with 1550 MHz memory clock (if possible). Another example of this assumption is the GeForce 7800GTX 512 core and memory clocks - 550 / 850 MHz (or 1700 MHz effective) - a 300 MHz increase.
Of course, this is quite interesting. We saw that having more quad does provide slightly higher frame rates. We also saw that with more than 5 quads, the increase per quad is smaller and we blame this on lack of bandwidth. By adding more bandwidth we can get higher frame rates and more likely we have to get above the 100 MHz difference for the 6 quad GeForce 7800GTX. Time to see if our speculation is correct. Let's see just how much performance increase we can squeeze out of core and memory clock changes. All the numbers below are in percent, not frame rates.
Quake 4 - Data Processing Plant, 1024 x 768 4xAA 16x AF
(6 PS Quads)
600 - 650
550 - 600
500 - 550
450 - 500
400 - 450
350 - 400
300 - 350
250 - 300
Quake 4 - Data Processing Plant, 1024 x 768 4xAA 16x AF
(5 PS Quads)
600 - 650
550 - 600
500 - 550
450 - 500
400 - 450
350 - 400
300 - 350
250 - 300
Quake 4 - Data Processing Plant, 1024 x 768 4xAA 16x AF
(4 PS Quads)
600 - 650
550 - 600
500 - 550
450 - 500
400 - 450
350 - 400
300 - 350
250 - 300
We can see that clock changes can offer higher gains than having more quads, ranging from 3 to 12 percent. However, there's a caveat to using higher clocks. Now, this is quite hard to visualize, but remember that we need to have at least a 100 MHz difference between core and memory clocks. We can see that's true even for clock increases, but there's more to it than that.
With 6 quads, increasing core clock from 350 to 400 MHz has about the same increase as increasing memory clock from 450 to 500 MHz (or 900 to 1000 MHz effective). However, at higher clocks, we have to use a higher base memory clock to get as much increase as with the core. Increasing core clock from 400 to 450 MHz offers the same increase as increasing memory clock from 600 to 650 MHz (or 1200 to 1350 MHz effective). So, in that respect, the GeForce 7800GTX 512 MB can be thought of as the upper most limit of what the GeForce 7 can do. There's no point to using even higher clocks except in some extreme situations, since it will most likely gain very little benefit.
If we look at the results with 4 and 5 quads, the situation is similar. With 4 quads, we gain pretty much the same by increasing the core from 350 to 400 MHz as we did with increasing the memory from 450 to 500 MHz (or 900 to 1000 MHz effective). As we hit 500 MHz, we have to increase the memory clock from at least 600 to 650 MHz (or 1200 to 1300 MHz effective). So, having less quad actually means less strain on bandwidth, but it's not really a plus point since we already need higher memory clocks to compensate in the first place.
What have we learned?
On a mostly texture dependent game like Quake 4, we can see having more quads does help performance with the GeForce 7 series (somewhat). However there are limits to this for several different reasons. One, having more quads means higher bandwidth is required to maintain optimal usage. The solution may seem simple enough - simply add more bandwidth, in this case by increasing the memory clock. We can do that, but then we will ran into the second problem - with higher clocks an even higher increase in memory clock is needed - that can range from 100 MHz to 300 MHz (or more if you can get high enough) on very high clocks. A more efficient approach to adding more quads or ROPs (since we're talking about AA and AF performance) is to clock the card higher, but eventually you'll ran into the second problem. Using higher and higher clocks is not always possible, just like in our case.
Even if you managed to clock the card high enough to get the most out of all 6 quads, most of the performance increase you get will go to 'improve' anisotropic filtering performance and not AA. So, running with AA and AF will always be slightly slower, even at very high clocks, AF's penalty is very small, maybe around 3 - 5 percent overall. With less quad, the penalty (on the same clock) will be slightly higher.
There's no point of having more quad, or more precisely, ROP units if you don't have the bandwidth to feed them all, particularly for texture dependent games. You can compensate the lack of bandwidth with higher memory clocks, but there's a limit to that. Compression and caching may help some, but not in the raw bandwidth department - these features will only allow the card to make optimal use of the bandwidth that's available. Sans a new, wider memory controller, the next generation of cards will likely not offer much improvement in texture dependent games than the current generation of cards.
F.E.A.R
Thankfully, games are now beginning the transition to shaders. One such games is F.E.A.R, which is still taxing even on the GeForce 7800GTX 256 MB with AA and AF enabled. So how does the number of quads of core and memory clocks influence performance in this game? Let's find out.
F.E.A.R - Performance Test, 1024 x 768 (6 PS Quads)
Bilinear
Trilinear
2x AF
4x AF
8x AF
16x AF
4x AA Bilinear
4x AA 16x AF
F.E.A.R - Performance Test, 1024 x 768 (5 PS Quads)
Bilinear
Trilinear
2x AF
4x AF
8x AF
16x AF
4x AA Bilinear
4x AA 16x AF
F.E.A.R - Performance Test, 1024 x 768 (4 PS Quads)
Bilinear
Trilinear
2x AF
4x AF
8x AF
16x AF
4x AA Bilinear
4x AA 16x AF
Looks like we still have some elbow room with F.E.A.R's performance test. Using different types of filtering or changing the number of samples have quite an effect on performance. We're pretty sure we're not system limited here. F.E.A.R is a shader heavy game, so it's interesting to see the performance difference between 4, 5 and 6 quads on the same core and memory clock. All the results do have the same tendency - lower frame rates as more complicated filtering and / or more samples are used.
A little interesting side note: looks like 2x AF and bilinear are not the same, as do 4x AF with trilinear, despite they both respectively use the same number of texture samples.
With 6 quads, anisotropic filtering penalty can range from 3 to 7 fps, while with 5 quads its around 10 fps. Strangely enough, 4 quads experienced less penalty than 5 quads, ranging from 6 to 8 fps. F.E.A.R benchmark results do fluctuate a bit, usually by 2 - 3 fps, so we think that the actual penalty is actually the same regardless of the number of quads. Overall, the difference is around 5 fps from 5 to 6 quads and around 10 fps from 4 to 5 quads. Look at the bilinear, 8x and 16x AF results. Once AA is put into the mix, the picture change slightly - frame rates with 5 and 6 quads are identical, while 4 quads lag slightly behind. How is that possible? Let's see those numbers in a table to see them more clearly.
|
|
6 Quad |
|
|
|
|
|
|
|
|
|
Bilinear | Trilinear | 2xAF | 4xAF | 8xAF | 16xAF | 4xAA Bilinear | 4xAA 16xAF |
| Min | 47 | 45 | 44 | 43 | 44 | 45 | 40 | 38 |
| Avg | 119.65 | 109.32 | 112.45 | 106.85 | 103.14 | 103.9 | 72.27 | 61.84 |
| Max | 380 | 238 | 381 | 399 | 306 | 318 | 249 | 137 |
|
|
|
|
|
|
|
|
|
|
|
|
5 Quad |
|
|
|
|
|
|
|
|
|
Bilinear | Trilinear | 2xAF | 4xAF | 8xAF | 16xAF | 4xAA Bilinear | 4xAA 16xAF |
| Min | 48 | 46 | 44 | 46 | 46 | 44 | 38 | 36 |
| Avg | 112.13 | 107.92 | 100.82 | 97.75 | 99.64 | 98.07 | 69.22 | 61.94 |
| Max | 372 | 350 | 215 | 203 | 354 | 272 | 199 | 217 |
|
|
|
|
|
|
|
|
|
|
|
|
4 Quad |
|
|
|
|
|
|
|
|
|
Bilinear | Trilinear | 2xAF | 4xAF | 8xAF | 16xAF | 4xAA Bilinear | 4xAA 16xAF |
| Min | 46 | 43 | 46 | 44 | 43 | 42 | 34 | 31 |
| Avg | 103.03 | 99.23 | 97.25 | 91.29 | 89.6 | 89.23 | 65.7 | 58.66 |
| Max | 374 | 386 | 356 | 335 | 27 | 299 | 249 | 223 |
With the maximum fps out of the way, we can finally see just how much frames are lost to AA - around 40 fps, regardless of the number of quads. With 6 quads, we still have some elbow room with pure AA, but that's also gone as soon as we enable 16x AF. F.E.A.R does behave differently compared to Quake 4. It looks like we do gain frame rates with each quad in both AA and AF, not just AF. Let's see the differences per quad.
|
|
Bilinear | Trilinear | 2xAF | 4xAF | 8xAF | 16xAF | 4xAA Bilinear | 4xAA 16xAF |
| 5 to 6 Quad |
|
|
|
|
|
|
|
|
| Min | -2.08% | -2.17% | 0.00% | -6.52% | -4.35% | 2.27% | 5.26% | 5.56% |
|
Avg |
6.70% | 1.29% | 11.53% | 9.31% | 3.51% | 5.95% | 4.40% | -0.15% |
|
Max |
2.15% | -32.00% | 77.21% | 96.55% | -13.56% | 16.91% | 25.13% | -36.87% |
|
|
|
|
|
|
|
|
|
|
| 4 to 5 Quad |
|
|
|
|
|
|
|
|
| Min | 4.35% | 6.98% | -4.35% | 4.55% | 6.98% | 4.76% | 11.76% | 16.13% |
|
Avg |
8.84% | 8.76% | 3.68% | 7.08% | 11.21% | 9.91% | 5.36% | 5.60% |
|
Max |
-0.53% | -9.33% | -39.61% | -39.40% | 1211.11% | -9.03% | -20.08% | -2.69% |
|
|
|
|
|
|
|
|
|
|
| Per Quad (Average) |
|
|
|
|
|
|
|
|
| Min | 1.13% | 2.40% | -2.17% | -0.99% | 1.31% | 3.52% | 8.51% | 10.84% |
|
Avg |
7.77% | 5.03% | 7.60% | 8.19% | 7.36% | 7.93% | 4.88% | 2.72% |
|
Max |
0.81% | -20.66% | 18.80% | 28.57% | 598.78% | 3.94% | 2.52% | -19.78% |
Well, by looking at this table we can say that just like Quake 4, the GeForce 7 gain the most in F.E.A.R by moving from 4 to 5 quads. However, we can also see that unlike Quake 3, F.E.A.R does gain more frames with pure AA than AF. In fact, we gain almost nothing in AF by moving from 5 to 6 quads. The various filtering settings are quite interesting. We mentioned earlier that enabling anisotropy 2x AF and 4x AF is different to bilinear and trilinear. Here we can see we actually gain more with 6 quads than 5 under those settings. But in the more used mode such as bilinear, trilinear, 8x AF and 16x AF, we gain less with 6 quads. Let's move on to the next table.
| 6 Quad |
|
|
|
|
|
|
AA (with Bilinear) | AA (with 16x AF) | 16x AF (with Bilinear) | 16x AF (with AA) |
| Difference (Min) | 14.89% | 15.56% | 4.26% | 0.66% |
| Difference (Avg) | 39.60% | 40.48% | 13.16% | 0.88% |
| Difference (Max) | 34.47% | 56.92% | 16.32% | 22.44% |
|
|
|
|
|
|
| 5 Quad |
|
|
|
|
|
|
AA (with Bilinear) | AA (with 16x AF) | 16x AF (with Bilinear) | 16x AF (with AA) |
| Difference (Min) | 20.83% | 18.18% | 8.33% | -2.65% |
| Difference (Avg) | 38.27% | 36.84% | 12.54% | -1.43% |
| Difference (Max) | 46.51% | 20.22% | 26.88% | -26.28% |
|
|
|
|
|
|
| 4 Quad |
|
|
|
|
|
|
AA (with Bilinear) | AA (with 16x AF) | 16x AF (with Bilinear) | 16x AF (with AA) |
| Difference (Min) | 26.09% | 26.19% | 8.70% | 0.10% |
| Difference (Avg) | 36.23% | 34.26% | 13.39% | -1.96% |
| Difference (Max) | 33.42% | 25.42% | 20.05% | -8.00% |
Well, this is more like it. We can see that performance penalty with AA is pretty high and pretty much the same, regardless of the number of quads. If there were differences, it's around 5 percent and not really noticeable in gameplay. It could also be cause by normal variations between runs. This is true for both pure AA and AA with AF. We can also see AF penalty is pretty much the same regardless of the number of quads.
Remember that F.E.A.R is a shader heavy game, so the factor here may not be ROP units inside each quad, but pixel shaders units instead. All this seem to indicate that we're core limited (or at least pixel shader limited) and not bandwidth limited in F.E.A.R. It's very likely that increasing clock on both core and memory clock should provide more tangible gains than with Quake 4. That's the only thing we could do, since we can't add more pixel shader units. Let's take a look. Below you'll see the graphs with core and memory clock changes, in both default (no AA, trilinear) and with 4x AA and 16x AF. All the numbers below are in frame rates.
F.E.A.R - Performance Test, 1024 x 768 4x AA 16xAF (6
PS Quads)
650
600
550
500
450
400
350
300
250
F.E.A.R - Performance Test, 1024 x 768 4x AA 16x AF(5
PS Quads)
650
600
550
500
450
400
350
300
250
F.E.A.R - Performance Test, 1024 x 768 4xAA 16xAF (4 PS
Quads)
650
600
550
500
450
400
350
300
250
If you look at the results with no AA and AF, its clear we are core limited. Having more bandwidth won't give you higher frame rates, so the only thing we can do in this situation is use higher core clocks. This is true regardless of the number of quads. In this situation, having more quads will only give you less than 10 percent improvement (around 5 - 10 fps).
With AA and AF, its a slightly different situation. Only with 6 quads did we see any significant improvement with higher memory clocks. Why is that? We know that AA and AF penalty are pretty much the same with 4, 5 and 6 quads. Well, the answer is actually quite simple - fillrate - or more precisely pixel shader fill rate. Remember 6 quads means a total of 48 shader units on the GeForce 7, while 4 and 5 quads mean 16 and 32 shader units, respectively. Those extra shader units seem to appreciate the extra bandwidth, if present. And AA needs lots of fillrate. In texture dependent games, ROP units will be more influential here, but F.E.A.R is a shader heavy game. So, that's why it really benefit from the presence of extra pixel shader units.
Now lets see just how core and memory clock changes affect this game. This time, we'll separate results with and without AA and AF. These numbers are in percent, not frame rates.
F.E.A.R - Performance Test, 1024 x 768 (6 PS Quads)
600 - 650
550 - 600
500 - 550
450 - 500
400 - 450
350 - 400
300 - 350
250 - 300
F.E.A.R - Performance Test, 1024 x 768 (5 PS Quads)
600 - 650
550 - 600
500 - 550
450 - 500
400 - 450
350 - 400
300 - 350
250 - 300
F.E.A.R - Performance Test, 1024 x 768 (4 PS Quads)
600 - 650
550 - 600
500 - 550
450 - 500
400 - 450
350 - 400
300 - 350
250 - 300
Another proof that we're pixel shader limited in F.E.A.R. With the move from 450 to 500 MHz, it's actually the 4 quad that gain the most. However, that doesn't mean having 6 quad is useless. We do get a higher increase in frame rates with higher memory clocks with 6 quads than with 4 and 5. On a side note, the graphs here is a little bit strange - just look at the move from 350 to 400 MHz memory clock. We don't know what's causing this, but we think its either latency or buffer related.
Since we're shader limited, core clock rules here. That's why most likely we will not be seeing any more increase with memory clocks above 650 MHz or 1350 MHz effective. At least, not without AA and AF. We need to push the core even higher, probably at least 550 MHz, just like the GeForce 7800GTX 512 to maintain pixel processing and memory bandwidth parity.
Now, let's see what happens with AA and AF.
F.E.A.R - Performance Test, 1024 x 768 4xAA 16xAF (6 PS
Quads)
600 - 650
550 - 600
500 - 550
450 - 500
400 - 450
350 - 400
300 - 350
250 - 300
F.E.A.R - Performance Test, 1024 x 768 4xAA 16xAF (5 PS
Quads)
600 - 650
550 - 600
500 - 550
450 - 500
400 - 450
350 - 400
300 - 350
250 - 300
F.E.A.R - Performance Test, 1024 x 768 4xAA 16xAF (4 PS
Quads)
600 - 650
550 - 600
500 - 550
450 - 500
400 - 450
350 - 400
300 - 350
250 - 300
Now, that's just odd, but it's actually quite normal in a way. Remember, it's shaders and not textures, so looks like our guess about the buffers or latency was right. With textures, the increase should be more gradual in decline, just like the Quake 4 graphs. Look at the 4 and 5 quad graphs. We can see at higher memory clocks the increase is more regular and relatively constant. Probably because it's in sync with the demand of the pixel shaders units. With 6 quads and extra pixel shader units, the bandwidth demand is larger, that's why we're seeing a difference. Only on even higher clocks did the memory controller, buffers and pixel shaders become in sync again.
So, again what have we learned?
On shader heavy games like F.E.A.R, more quads can be good, but not for all the obvious reasons. Its the pixel shaders units inside each quad that's making the difference. Depending on what kind of shaders are used, you'll likely gain more with core clock increases to compensate for the lack of pixel shader units than with memory clock increases. You can push core clocks as high as you can go to get higher frame rates to play without AA and AF. With 6 quads, slightly higher memory clock does provide some benefit, but it won't be much or even noticeable in gameplay.
With AA and AF, things get complicated real quick. We saw that more quads, or more accurately, more ROPs won't help in AA and AF, so we'll have to rely on brute force to attain higher frame rates here. Unlike texture dependent games, shaders in F.E.A.R are more complicated. That means it can take longer to run than simple shaders in texture dependent games. So, memory bandwidth here means not only raw bandwidth alone, but also memory bandwidth that's in sync with the pixel shader units' request.
Unfortunately, that means making an estimate on how high a core or memory clock you could use to get the most gain very difficult.
Instead of having more quad, F.E.A.R will likely benefit more from extra shader units. A plus to this is that they are far less sensitive to bandwidth, so you could actually go with lower memory clocks than less shader units. You can compensate the lack of shader units with higher core clocks. Using higher memory clocks may bring some benefit, but you'll have to be careful about the sync problem where you'll gain no frames at all until you hit a higher clock..
Conclusion:
The GeForce 7 architecture is very sensitive to bandwidth. Without the presence of additional bandwidth, there's actually no point of adding more quad(s). It would be more appropriate to simply clock the core high enough to keep parity with memory bandwidth and vice versa. However, even on texture dependent games like Quake 4, we saw that it won't help AA penalty at all (although you'll likely have higher nominal frame rates). In our case, that's at least around 20 fps or 20 percent (most likely more since we're system limited).An upside to this is that you can compensate for the lack of quads with high enough memory bandwidth, or more precisely, memory clock. Of course, finding a GeForce 7800GT or 7800GS with memory that can scale that high is difficult, but not impossible. So good news for overclockers on a budget or even gamers who don't want to pay and arm and a leg for a GeForce 7800GTX or the elusive GeForce 7800GTX 512.
On shader heavy games, getting the core and memory clock combination right is tricky. It can be done, but it will be a process of trial and error. Of course, if you only care about the nominal frame rate, you can just clock the core and memory as high as you can.
Some Forward Looking Statements
Ah, speculation - the food of tech afficionados and hardware review forums posters everywhere. Well, we chose not to speculate here but rather point out the areas that needs to be addressed by the next generation of graphics cards. First and foremost will be bandwidth. To get an even higher frame rates with texture dependent games, we are in desperate need of significantly more bandwidth. Now, there are at least three alternatives here:- Higher clocks
- Wider bus
- New, faster memory type
Depending on how much bandwidth will be available, more quads (in terms of ROPs) can be used. But we think in this case, NVIDIA should address AA performance instead of using more G70-like quad. Even at it's best, the GeForce 7 series quads still lost so much frame rate when AA is enabled (either pure or with AF). So, that implies a change in architecture. Just how much, we can't tell. With more efficient AA (or even a better implementation), both texture dependent and shader heavy games should be faster with AA and AF.
Of course, we could also, relatively speaking, 'give up' on texture dependent games all together and choose to add more pixel shader units. After all, they are far less sensitive to bandwidth and won't require much of it. At it's worst, it should be at least as fast as the current generation of graphics cards. The higher clock and larger buffers should allow for higher quality filtering and / or antialiasing to be used with less penalty. That's pretty much the road ATI took with the X1900 series.
Go to top