Tech-Hounds.com

Because gamers play games, not benchmarks




Pushing the Frame Rate Envelope

Last month, ATI finally launched their Radeon X1900 series, formerly known as R580. From a feature stand point, it's not that much different from their very late-to-market Radeon X1800 series. However, there is one very distinctive difference between the two - the number of pixel shader units. Whereas the Radeon X1800 has 16, the Radeon X1900 comes equipped with 48 shader units. The reason: to push shaders as fast as possible.

ATI's reasons for going bonanza with shaders is not without reason or deep thought. Newer games rely more and more on shaders instead of textures. But that's not the only reason. Textures not only take up space but is very heavy on bandwidth as well. The burden can even be too much for the current generation of graphics cards, even high end monsters like the GeForce 7800GTX or Radeon X1900 XT. GDDR3 memory clocks is already near its upper limits and that also apply to memory controllers in use by these cards as well. Next year's games or even this year's will soon come, with more detail, be it with textures or shaders (or a mixture of both).

Beyond3D recently had an interview with Eric Demers concerning ATI's design decision with the Radeon X1900 (which you can read here). It certainly raises some interesting questions and facts, which could be useful to predict where we will be going to in the next year or so. In particular interest to us, is of course the decision to go with more pixel shader units and maintain the number of ROP units. On the other hand, NVIDIA seem poised to extend the G70 - the GeForce 7 series - architecture further by adding more G70-like quad.

So, we decided to play around with a GeForce 7800GTX some more. This time, we took it for a spin to see the influence of quads, core and memory clocks on current games. This article started out as something to do on the weekend, but it quickly become more and more interesting (at least to us) as testing progresses.

Getting Ready

From the current crop of games, there are two games which we feel correspond to the texture dependent vs shader heavy games perfectly: Quake 4 and F.E.A.R. Despite having lower hardware requirements, F.E.A.R is definitely heavier of the two. Quake 4 on the other hand is still (loosely) based on the Doom 3 engine, which in turn still uses shaders from the pre SM 3.0 era. Both games looked amazing when viewed in full detail and gamers and reviewers alike used both extensively for making graphics card purchase decisions.

There's also something else both games share. Even with a GeForce 7800GTX or a Radeon X1800XT, you won't get high enough playable frame rates with AA and AF above 1024 x 768. Now that's heavy. The only way to get faster frame rates short of overclocking the card is by using multi rendering solution from both ATI and NVIDIA, Crossfire and SLI, respectively. But as we pointed out in our last article, this is not always the best solution.

Now that we have the games, we must choose how to enable / disable quads inside our sample of GeForce 7800GTX and change core and memory clocks. Since we don't want to do any hard modding (it is still a sample after all), we decided to rely on Alexey Nikolaychuk a.k.a Unwinder RivaTuner. Some might ask why not just use the GeForce 7800GT and newly released GeForce 7800GS. Well, this is not meant to be a review of those cards. We wanted to see what kind of influence one or two quads have on performance, both on different and similar clocks.

Performance

The test setup is pretty much the same like our last articles. AA and AF settings are applied in game, with image settings set to High Quality on the driver panel and no optimizations are enabled. Quake was set to High Quality, while bilinear filtering was set with the command 'image_filter GL_LINEAR_MIPMAP_NEAREST'. 4x AA was used, set with the command 'r_multisamples 4' and AF was set with 'image_anisotropy n' with n being the number of samples. F.E.A.R was set to the usual settings, all details are set to maximum, with Soft Shadows off. System related settings (physics etc) was set to minimum.

We'd like to thank both Tagan and Kingston for supplying with the additional power supply and 1 GB memory modules for this article.

Our test setup
AMD Athlon 64 3500+ socket 939
2 x 1024 MB Kingston KVR 3-3-3 PC3200 DDR-SDRAM
MSI K8N NForce 4 SLI motherboard
ASUS GeForce 7800GTX DDR3 256 MB graphics card
(running at standard clocks - core 486 MHz / memory 675 MHz (1350 MHz effective))
Maxtor DiamondMaxPlus9 80 GB Serial ATA 8 MB buffer
ASUS E-616 DVD-ROM
Tagan TG530-U15 530 watts ATX/BTX power supply

Windows XP Professional with Service Pack 2 installed
NVIDIA Forceware 81.98 reference driver
NVIDIA NForce 4 6.70 reference driver
Creative SoundBlaster Live! 24 bit 5.12.1.512 driver.
DirectX 9.0c

Performance

The results:

Quake 4

Quake 4 - Data Processing Plant, 1024 x 768 (6 PS Quads)
Bilinear
Trilinear
2x AF
4x AF
8x AF
16x AF
4x AA Bilinear
4x AA 16x AF
69
122.166
165
.
69
122.087
159
.
68
122.741
168
.
68
121.751
165
.
68
121.562
162
.
69
121.511
165
.
70
105.6
167
.
64
97.289
153
Quake 4 - Data Processing Plant, 1024 x 768 (5 PS Quads)
Bilinear
Trilinear
2x AF
4x AF
8x AF
16x AF
4x AA Bilinear
4x AA 16x AF
69
123.254
168
.
69
123.016
170
.
70
123.172
162
.
69
122.583
168
.
69
122.063
163
.
68
121.327
164
.
71
104.633
170
.
62
94.199
144
Quake 4 - Data Processing Plant, 1024 x 768 (4 PS Quads)
Bilinear
Trilinear
2x AF
4x AF
8x AF
16x AF
4x AA Bilinear
4x AA 16x AF
67
122.008
170
.
68
121.989
167
.
69
122.009
161
.
69
122.038
170
.
69
121.112
171
.
69
120.57
162
.
68
103.012
169
.
59
90.084
138

One glance at the results and you can see they are mostly system limited. Only with AA and AF enabled did we see a significant difference in frame rates. Because of this, we will only be looking at AA and AF results from Quake 4. Now, let's look closely at the results with 4, 5 and 6 PS quads. You can use the table below to see the numbers more clearly.


6 Quad







Bilinear Trilinear 2xAF 4xAF 8xAF 16xAF 4xAA Bilinear 4xAA 16xAF
Min 69 69 68 68 68 69 70 64
Avg 122.17 122.09 122.74 121.75 121.56 121.51 105.6 97.29
Max 165 159 168 165 162 165 167 153










5 Quad







Bilinear Trilinear 2xAF 4xAF 8xAF 16xAF 4xAA Bilinear 4xAA 16xAF
Min 69 69 70 69 69 68 71 62
Avg 123.25 123.02 123.17 122.58 122.06 121.33 104.63 94.2
Max 168 170 162 168 163 164 170 144










4 Quad







Bilinear Trilinear 2xAF 4xAF 8xAF 16xAF 4xAA Bilinear 4xAA 16xAF
Min 67 68 69 69 69 69 68 59
Avg 122.01 121.99 122.01 122.04 121.11 120.57 103.01 90.08
Max 170 167 161 170 171 162 169 138

Of course, on a more powerful system the results will be slightly different. Focus on AA and AF results, where we're not as system limited as with just bilinear or even 16x anisotropic filtering. 'Pure' AA performance is pretty much similar with 4, 5 or 6 quads, the difference is around 2 - 3 fps overall. So, having more quads won't give you higher frames in Quake 4 (or games that depend mostly on textures). Well, at least in pure AA with bilinear. However, it seems that having more quads does help performance with AA and AF enabled at the same time. The difference is not noticeable in gameplay, but it's interesting nonetheless. Let's see just how much performance we gain with each quad.


Bilinear Trilinear 2xAF 4xAF 8xAF 16xAF 4xAA Bilinear 4xAA 16xAF
5 to 6 Quad







Min 0.00% 0.00% -2.86% -1.45% -1.45% 1.47% -1.41% 3.23%
Avg -0.88% -0.76% -0.35% -0.68% -0.41% 0.15% 0.92% 3.28%
Max -1.79% -6.47% 3.70% -1.79% -0.61% 0.61% -1.76% 6.25%









4 to 5 Quad







Min 2.99% 1.47% 1.45% 0.00% 0.00% -1.45% 4.41% 5.08%
Avg 1.02% 0.84% 0.95% 0.45% 0.79% 0.63% 1.57% 4.57%
Max -1.18% 1.80% 0.62% -1.18% -4.68% 1.23% 0.59% 4.35%









Per Quad (Average)







Min 1.49% 0.74% -0.70% -0.72% -0.72% 0.01% 1.50% 4.16%
Avg 0.07% 0.04% 0.30% -0.12% 0.19% 0.39% 1.25% 3.92%
Max -1.48% -2.34% 2.16% -1.48% -2.65% 0.92% -0.59% 5.30%

Here we can see the move from 4 to 5 quads provide the highest increase in performance, both with AA and AF enabled and 'pure' AA, although 1 % higher is far from significant (with pure AA). While 6 quads is faster, the increase is actually smaller - probably due to limitations in memory bandwidth (which we'll look into later). What's interesting is that AA and AF gains can be up to three times 'pure' AA gains. This confirms our previous suspicion that having more quads have little to do with the GeForce 7's (and probably GeForce 6 as well) performance with AA, but it will help with AA AND AF. You can say that with each quad, performance under AA and AF gets closer to 'pure' AA levels - giving us almost 'free' anisotropic filtering.

OK, let's see the numbers on the first table in a different light.

6 Quad



AA (with Bilinear) AA (with 16x AF) 16x AF (with Bilinear) 16x AF (with AA)
Difference (Min) -1.45% 7.25% 0.00% 8.70%
Difference (Avg) 13.56% 19.93% 0.54% 6.37%
Difference (Max) -1.21% 7.27% 0.00% 8.48%




5 Quad



AA (with Bilinear) AA (with 16x AF) 16x AF (with Bilinear) 16x AF (with AA)
Difference (Min) -2.90% 8.82% 1.45% 11.72%
Difference (Avg) 15.11% 22.36% 1.56% 7.25%
Difference (Max) -1.19% 12.20% 2.38% 13.39%




4 Quad



AA (with Bilinear) AA (with 16x AF) 16x AF (with Bilinear) 16x AF (with AA)
Difference (Min) -1.49% 14.49% -2.99% 15.99%
Difference (Avg) 15.57% 25.28% 1.18% 9.72%
Difference (Max) 0.59% 14.81% 4.71% 14.23%

We think this table sums it up pretty well. Running at the same clock on both core and memory, having more quads will (only) help the GeForce 7 anisotropic filtering performance. Look at the difference between 4, 5 and 6 quads - that's about 2 percent per quad overall. You might say, it's closer to 1 percent with 6 quads, but that's very likely caused by limited memory bandwidth. That's something we can work around - just clock the memory even higher.

The question is how much? Unfortunately, this particular card is only stable at around 675 MHz or 1350 MHz effective memory clock. At 1400 MHz, not only are scores lower, but artifacts begin to appear. So, we decided to tackle this matter in a different manner - by clocking down the card. We clocked the card's core from 250 MHz to 500 MHz in 50 MHz increments and the memory from 350 MHz to 650 MHz in 50 MHz increments (or 700 MHz to 1300 MHz effectively in 100 MHz increments). On core measurements, we set the memory clock to 1200 MHz - the default reference clock, while on memory measurements, the core was set to 500 MHz.

Along the way, we decided to test with 4 and 5 quads as well. This way, we'll get some idea just how high memory clock we need to have parity between pixel processing and memory bandwidth with various number of quads. All of these numbers are in frame rates per second.

Quake 4 - Data Processing Plant, 1024 x 768 4xAA 16x AF (6 PS Quads)
650
600
550
500
450
400
350
300
250
0
95.98
0
0
91.97
0
0
87.31
0
92.16
81.84
0
89.41
75.78
0
85.51
68.79
0
79.4
61.27
0
71.88
0
0
64.41
0
Quake 4 - Data Processing Plant, 1024 x 768 4xAA 16x AF (5 PS Quads)
650
600
550
500
450
400
350
300
250
0
94.38
.
0
89.83
.
0
85.08
.
89.71
80.18
.
87.1
74.31
.
82.73
68.12
.
76.79
60.69
.
69.31
0
.
61.85
0
Quake 4 - Data Processing Plant, 1024 x 768 4xAA 16x AF (4 PS Quads)
650
600
550
500
450
400
350
300
250
0
90.15
.
0
86.54
.
0
82.08
.
86.36
77.87
.
83.09
71.74
.
78.53
65.91
.
72.29
58.85
.
64.66
0
.
57.89
0

While there are variations between results, we can see that for the most part, we need to have at least a 100 MHz difference between the core and memory to maintain optimal combination of pixel processing and memory bandwidth. That's true whether the card has 4, 5 or 6 quads. If we have less quads, the memory clock must be even higher than 100 MHz to maintain that parity - with 5 quads we have to run at 500 / 1300 MHz to keep up with 6 quads running at 500 / 1200 MHz/ That's a 150 MHz difference. For 4 quads, that will be 500 / 1350 MHz or 500 / 1400 MHz (or a 200 MHz difference).

This fact can be seen in another way: even higher memory clocks can provide us with even higher frame rates. It makes sense - we saw earlier that the GeForce 7 still has some elbow room with AF, which needs lots of bandwidth due to the extensive amount of texture samples. However, there's a limit to that - we will only see around 4 to 5 fps at most with each 100 MHz increase in memory clock. That increase will most likely be smaller with higher clocks. If we have to speculate, it's very likely that if we we're to use the standard core clock of ASUS EN7800GTX TOP, we will probably get maximum performance with 1550 MHz memory clock (if possible). Another example of this assumption is the GeForce 7800GTX 512 core and memory clocks - 550 / 850 MHz (or 1700 MHz effective) - a 300 MHz increase.

Of course, this is quite interesting. We saw that having more quad does provide slightly higher frame rates. We also saw that with more than 5 quads, the increase per quad is smaller and we blame this on lack of bandwidth. By adding more bandwidth we can get higher frame rates and more likely we have to get above the 100 MHz difference for the 6 quad GeForce 7800GTX. Time to see if our speculation is correct. Let's see just how much performance increase we can squeeze out of core and memory clock changes. All the numbers below are in percent, not frame rates.

Quake 4 - Data Processing Plant, 1024 x 768 4xAA 16x AF (6 PS Quads)
600 - 650
550 - 600
500 - 550
450 - 500
400 - 450
350 - 400
300 - 350
250 - 300
0
4.35
.
0
5.34
.
0
6.68
.
3.08
7.99
.
4.56
10.16
.
7.69
12.29
.
10.46
0
.
11.59
0
Quake 4 - Data Processing Plant, 1024 x 768 4xAA 16x AF (5 PS Quads)
600 - 650
550 - 600
500 - 550
450 - 500
400 - 450
350 - 400
300 - 350
250 - 300
0
5.06
.
0
5.59
.
0
6.11
.
3
7.89
.
5.29
9.09
.
7.72
12.25
.
10.28
0
.
12.06
0
Quake 4 - Data Processing Plant, 1024 x 768 4xAA 16x AF (4 PS Quads)
600 - 650
550 - 600
500 - 550
450 - 500
400 - 450
350 - 400
300 - 350
250 - 300
0
4.16
.
0
5.43
.
0
5.45
.
3.93
8.51
.
5.81
8.84
.
8.64
12
.
11.8
0
.
11.7
0

We can see that clock changes can offer higher gains than having more quads, ranging from 3 to 12 percent. However, there's a caveat to using higher clocks. Now, this is quite hard to visualize, but remember that we need to have at least a 100 MHz difference between core and memory clocks. We can see that's true even for clock increases, but there's more to it than that.

With 6 quads, increasing core clock from 350 to 400 MHz has about the same increase as increasing memory clock from 450 to 500 MHz (or 900 to 1000 MHz effective). However, at higher clocks, we have to use a higher base memory clock to get as much increase as with the core. Increasing core clock from 400 to 450 MHz offers the same increase as increasing memory clock from 600 to 650 MHz (or 1200 to 1350 MHz effective). So, in that respect, the GeForce 7800GTX 512 MB can be thought of as the upper most limit of what the GeForce 7 can do. There's no point to using even higher clocks except in some extreme situations, since it will most likely gain very little benefit.

If we look at the results with 4 and 5 quads, the situation is similar. With 4 quads, we gain pretty much the same by increasing the core from 350 to 400 MHz as we did with increasing the memory from 450 to 500 MHz (or 900 to 1000 MHz effective). As we hit 500 MHz, we have to increase the memory clock from at least 600 to 650 MHz (or 1200 to 1300 MHz effective). So, having less quad actually means less strain on bandwidth, but it's not really a plus point since we already need higher memory clocks to compensate in the first place.

What have we learned?

On a mostly texture dependent game like Quake 4, we can see having more quads does help performance with the GeForce 7 series (somewhat). However there are limits to this for several different reasons. One, having more quads means higher bandwidth is required to maintain optimal usage. The solution may seem simple enough - simply add more bandwidth, in this case by increasing the memory clock. We can do that, but then we will ran into the second problem - with higher clocks an even higher increase in memory clock is needed - that can range from 100 MHz to 300 MHz (or more if you can get high enough) on very high clocks. A more efficient approach to adding more quads or ROPs (since we're talking about AA and AF performance) is to clock the card higher, but eventually you'll ran into the second problem. Using higher and higher clocks is not always possible, just like in our case.

Even if you managed to clock the card high enough to get the most out of all 6 quads, most of the performance increase you get will go to 'improve' anisotropic filtering performance and not AA. So, running with AA and AF will always be slightly slower, even at very high clocks, AF's penalty is very small, maybe around 3 - 5 percent overall. With less quad, the penalty (on the same clock) will be slightly higher.

There's no point of having more quad, or more precisely, ROP units if you don't have the bandwidth to feed them all, particularly for texture dependent games. You can compensate the lack of bandwidth with higher memory clocks, but there's a limit to that. Compression and caching may help some, but not in the raw bandwidth department - these features will only allow the card to make optimal use of the bandwidth that's available. Sans a new, wider memory controller, the next generation of cards will likely not offer much improvement in texture dependent games than the current generation of cards.

F.E.A.R

Thankfully, games are now beginning the transition to shaders. One such games is F.E.A.R, which is still taxing even on the GeForce 7800GTX 256 MB with AA and AF enabled. So how does the number of quads of core and memory clocks influence performance in this game? Let's find out.

F.E.A.R - Performance Test, 1024 x 768 (6 PS Quads)
Bilinear
Trilinear
2x AF
4x AF
8x AF
16x AF
4x AA Bilinear
4x AA 16x AF
47
119.651
380
.
45
109.316
238
.
44
112.446
381
.
43
106.853
399
.
44
103.143
306
.
45
103.9
318
.
40
72.27
249
.
38
61.844
137
F.E.A.R - Performance Test, 1024 x 768 (5 PS Quads)
Bilinear
Trilinear
2x AF
4x AF
8x AF
16x AF
4x AA Bilinear
4x AA 16x AF
48
112.133
372
.
46
107.923
350
.
44
100.822
215
.
46
97.75
203
.
46
99.643
354
.
44
98.067
272
.
38
69.224
199
.
36
61.939
217
F.E.A.R - Performance Test, 1024 x 768 (4 PS Quads)
Bilinear
Trilinear
2x AF
4x AF
8x AF
16x AF
4x AA Bilinear
4x AA 16x AF
46
103.027
374
.
43
99.23
386
.
46
97.247
356
.
44
91.29
335
.
43
89.595
27
.
42
89.227
299
.
34
65.704
249
.
31
58.656
223

Looks like we still have some elbow room with F.E.A.R's performance test. Using different types of filtering or changing the number of samples have quite an effect on performance. We're pretty sure we're not system limited here. F.E.A.R is a shader heavy game, so it's interesting to see the performance difference between 4, 5 and 6 quads on the same core and memory clock. All the results do have the same tendency - lower frame rates as more complicated filtering and / or more samples are used.

A little interesting side note: looks like 2x AF and bilinear are not the same, as do 4x AF with trilinear, despite they both respectively use the same number of texture samples.

With 6 quads, anisotropic filtering penalty can range from 3 to 7 fps, while with 5 quads its around 10 fps. Strangely enough, 4 quads experienced less penalty than 5 quads, ranging from 6 to 8 fps. F.E.A.R benchmark results do fluctuate a bit, usually by 2 - 3 fps, so we think that the actual penalty is actually the same regardless of the number of quads. Overall, the difference is around 5 fps from 5 to 6 quads and around 10 fps from 4 to 5 quads. Look at the bilinear, 8x and 16x AF results. Once AA is put into the mix, the picture change slightly - frame rates with 5 and 6 quads are identical, while 4 quads lag slightly behind. How is that possible? Let's see those numbers in a table to see them more clearly.


6 Quad







Bilinear Trilinear 2xAF 4xAF 8xAF 16xAF 4xAA Bilinear 4xAA 16xAF
Min 47 45 44 43 44 45 40 38
Avg 119.65 109.32 112.45 106.85 103.14 103.9 72.27 61.84
Max 380 238 381 399 306 318 249 137










5 Quad







Bilinear Trilinear 2xAF 4xAF 8xAF 16xAF 4xAA Bilinear 4xAA 16xAF
Min 48 46 44 46 46 44 38 36
Avg 112.13 107.92 100.82 97.75 99.64 98.07 69.22 61.94
Max 372 350 215 203 354 272 199 217










4 Quad







Bilinear Trilinear 2xAF 4xAF 8xAF 16xAF 4xAA Bilinear 4xAA 16xAF
Min 46 43 46 44 43 42 34 31
Avg 103.03 99.23 97.25 91.29 89.6 89.23 65.7 58.66
Max 374 386 356 335 27 299 249 223

With the maximum fps out of the way, we can finally see just how much frames are lost to AA - around 40 fps, regardless of the number of quads. With 6 quads, we still have some elbow room with pure AA, but that's also gone as soon as we enable 16x AF. F.E.A.R does behave differently compared to Quake 4. It looks like we do gain frame rates with each quad in both AA and AF, not just AF. Let's see the differences per quad.


Bilinear Trilinear 2xAF 4xAF 8xAF 16xAF 4xAA Bilinear 4xAA 16xAF
5 to 6 Quad







Min -2.08% -2.17% 0.00% -6.52% -4.35% 2.27% 5.26% 5.56%
Avg
6.70% 1.29% 11.53% 9.31% 3.51% 5.95% 4.40% -0.15%
Max
2.15%