Benchmarking on Trial - Part 1
Performance evaluations are never easy and the argument of using synthetic and real world benchmarks is one that's neverending among reviewers, the general public and even manufacturers. However, it is important to realize that there is a difference between benchmarks and reviews - benchmarks is usually a part of a full review, a part that focuses on performance since it is relatively easy to measure than other features of a product (image / sound quality, ease of use and installation etc).Since most people do put performance first when buying a hardware, it's only natural for reviews to put more focus on performance than say, value added features. Benchmarks are usually done with a fixed setup where only one type of hardware is changed - the hardware that's being benchmarked. As a reference point, a similar product, either in performance, capabilities or price is also benchmarked. Afterwards, the results of the benchmarks are compared to see which one offer the higher performance, according to the benchmark's metrics.
Once we have a performance comparison, the reviewer will usually add some other objective and subjective evaluations, ranging from their experience with the product, some notable differences in features and output quality, usually visual quality of the images for graphics cards. Noise, heat levels, power consumption are also becoming more important than ever. Price may also be an influencing factor - let's face it, if two products being compared basically offer the same performance, we would either go with the one with more features or less price - all things being equal (warranty / support). A slightly more difficult comparison will be if the the two product differs in both price and performance. Usually in these cases, a price / performance (and sometimes plus features) ratio is used, but these methods don't take into account the concern of the very restrictive price buyers - one that will only spend a certain amount of money on a product and no higher.
Back to the subject - benchmarks. Since we want to evaluate a hardware - a specific aspect of a system - it's typical for reviewers to use benchmarks that scales with that particular hardware, be it a processor, memory, graphics cards or other components. Synthetic benchmarks is the ideal testing tool in this regard - it completely isolates a particular aspect of a system. However, keep in mind that performance with real applications in real life scenarios are often bottlenecked by several aspects. For instance, a graphics card can only render as fast as the processor can supply data to it and the processor can only process data if the data is ready in it's memory. An overly simplistic example, but it's not far from the truth. Just look at the many components involved in that statement - graphics card, processor and memory. There's also the processor's front side bus, memory controller and even the graphics card's interface. That's why reviewers in general (and also the general public) favor benchmark results from real world application or games that they use / play. They have more bearing on what kind of performance they can expect by adding a particular hardware to their system.
But even the use of real world applications and games as benchmarks have come under fire in recent years. Since benchmarks needs to produce repeatable / reproducible results, they often make use of a series of actions or scenarios. While makers of benchmarks typically state that these scenarios are selected as close to possible to real world day to day use, they will not cover all type of use. Benchmarks using real world applications or games are not immune to this. More so for graphical benchmarks with games, remember that game developers do not focus their game code solely for graphics. They must develop their game to at least be equally balanced between the processor and graphics card. In this regard, a graphically oriented benchmark - usually a replay or a timedemo that solely focus on graphics performance will not offer benchmark results that portray typical frame rate for the entire game. There will be times in the game where you will have graphically intensive scenes and less intensive moments. Naturally frame rates between those two kinds of scenes will be different, less intensive scenes will have higher frame rates and the more intensive scenes have lower frame rates. There' will also be scenes where you are more bottlenecked by the processor or memory subsystem rather than the graphics card. In these scenes the graphics card won't be able to reach its peak potential and results are lower than they can be.
So, in that regard, graphical benchmarks using timedemos or replays are similar to synthetic benchmarks - they isolate the load on a specific hardware, in this case the graphics card despite they are real world applications or games. As a natural progression, reviewers and manufacturers are now using more 'real world' performance testing with games in particular. The idea is to monitor frame rate during actual gameplay and not replays or timedemos so that the results are as close as can be to actual use or in this case actual gameplay frame rates users and gamers can expect. To keep repeatibility and reproducibility, each gameplay session is done in a specific manner so the outcome is similar each time. Results from repeated runs are then averaged as a proxy for that scenario, since there are variations between runs which is higher than what can be expected from timedemos.
Today we're going to share our experiences with both of these benchmark tools - gameplay testing sessions and replay / timedemo runs. This article is not meant as an argument both for or against either of them, rather an 'adventure' in exploring peculiarities of both tools.
Problems
Of the three games we're going to test today (F.E.A.R, Quake 4 and Serious Sam II), we have pretty much relied solely on gameplay testing session for Quake 4. While this method of testing does have higher system overhead than a timedemo, the graphics rendering is more accurate. Quake 4's timedemo is 'broken', meaning some shadows are not displayed, some textures are missing and the wrong fonts are used on OSD displays like the ones you see on elevator / lift control panels and switches. Here are some examples:left: a misplaced OSD texture
right: notice there's shadows on the gameplay screenshot and none on the timedemo screenshot
Lately we have found results we're getting in Quake 4 don't seem to really represent frame rates we've experienced throughout the entire game, but we did choose the Data Processing Terminal (the 2nd level with the tower guardian) level because it scales pretty well with changes in resolution and graphical features. The same can also be said for Serious Sam II. With powerful graphic cards such as the GeForce 7900 and Radeon X1900 series, it's obvious we have to enable AA and AF to make the Greendale demo scale with changes in resolution. We hope by examining which level in Serious Sam II scales better than the developer supplied Greendale demo, we can use that new level for additional benchmarks in our performance evaluation.
Another problem in doing gameplay testing sessions is a boon for gamers, but a nightmare for reviewers - checkpoint saves. All three games have checkpoint saves in most of the levels. While they are helpful in playing the game, they often cause a stutter / a sharp drop in frame rate because the game needs to write data to the hard drive. Because of this, we will only relying on average frame rate for most of this article. Hopefully, we can find levels that doesn't have any checkpoints, or if there's any, can be circumvented in some way.
The test setup and settings for the game are the same as our usual standard setup and settings, an Athlon 3500+ and a GeForce 7900GT running at 520 / 700 MHz (non reference core and memory clocks). Our focus is on graphical benchmarks, so we are trying to find levels that scale graphically. Here are the steps we've taken to narrow the choices down.
- We've decided to first test each level within each game under normal playing conditions at 1024 x 768, 1280 x 1024 (or 1280 x 960) and 1600 x 1200, without AA and AF. Differences in the results between those resolutions may indicate that a level is more graphically intensive than other levels. Of course, they may also be variations between runs.
- After selecting the top five levels that scales best with resolution, we will then focus on minimizing variations between runs.
- Afterwards, we compare results taken from the game's internal benchmarking tool and compare them to the results we got. Hopefully by this time we've narrowed the choice to a single level. Tthat would make testing so much easier and quicker for the final test.
- The final test will be to see how the levels we've chosen scale in resolution, with and without AA and AF compared to our usual benchmarks.
F.E.A.R
To get an idea of what frame rates we could expect from this game, we decided to play the game from start to finish three times, each with different resolutions. Please note that these are raw numbers and these numbers will likely be different with each run. We're not focusing on repeatability or reproducible results at this moment, instead trying to get a general idea which of F.E.A.R's levels are very responsive to resolution changes.The number of pixels at 1280 x 960 and 1600 x 1200 are roughly, 1.5 and 2.4 times the number of pixels at 1024 x 768. So, in a 100 percent graphically limited situations, we expect the performance hit to be around 30 and 50 percent, respectively. Of course, that's a raw estimate and we will never be fully 100 percent graphically limited in a game. Now, let's look at those numbers
|
|
1024 x 768 | 1280 x 960 | 1600 x 1200 |
|
Performance Hit 1280 | Performance Hit 1600 |
| Point of origin | 124.27 | 106.23 | 86.55 |
|
14.51% | 30.35% |
| First encounter | 120.01 | 100.3 | 71.65 |
|
16.42% | 40.30% |
| Infiltration | 129.53 | 108.44 | 81.77 |
|
16.29% | 36.87% |
| Heavy resistance | 116.22 | 94.96 | 71.96 |
|
18.30% | 38.09% |
| Bad water | 129.24 | 101.69 | 78.05 |
|
21.32% | 39.61% |
| Exeunt omnes | 117.71 | 102.06 | 78.84 |
|
13.29% | 33.02% |
| LZ is hot | 119.57 | 101.69 | 75.06 |
|
14.95% | 37.22% |
| Watchers | 121.74 | 100.11 | 73.4 |
|
17.77% | 39.71% |
| Bishop | 116.2 | 97.52 | 70.98 |
|
16.07% | 38.91% |
| Blindside | 129.08 | 105.44 | 81.61 |
|
18.31% | 36.78% |
| Sayonara, sucker | 139.8 | 110.58 | 84.02 |
|
20.90% | 39.90% |
| Unauthorized personnel | 120.16 | 100.25 | 73.31 |
|
16.56% | 38.99% |
| Afterimage | 121.27 | 89.63 | 72.01 |
|
26.09% | 40.62% |
| Alice Wade | 110.03 | 93.04 | 68.32 |
|
15.43% | 37.91% |
| Flight | 114.61 | 92.93 | 73.38 |
|
18.92% | 35.97% |
| Urban decay | 106.91 | 92.64 | 64.61 |
|
13.35% | 39.57% |
| Point of entry | 97.09 | 82.65 | 63.48 |
|
14.87% | 34.62% |
| Lapdog | 119.29 | 97.77 | 70.84 |
|
18.04% | 40.62% |
| Bypass | 143.21 | 114.22 | 83.01 |
|
20.24% | 42.04% |
| The Vault | 122.46 | 102.13 | 66.34 |
|
16.60% | 45.83% |
| Ground Zero | 125.41 | 94.46 | 68.44 |
|
24.68% | 45.43% |
| Aftermath | 106.19 | 87.18 | 64.89 |
|
17.90% | 38.89% |
|
Average (entire game) |
120.45 | 98.91 | 73.75 |
|
17.76% | 38.69% |
all numbers are in frame rates (except for percentages)
F.E.A.R reacts so nicely to resolution changes, which means you most likely will be seeing lots of graphically limited scenes throughout the entire game. Although we can prove this by benchmarking the entire game, that would take too much time and effort. To simplify things a bit, we took the five top levels with the most performance hit at 1600 x 1200 and 1280 x 960 and put them in the table below. Undoubtedly, you'll notice some other levels such as Ground Zero and The Vault have higher performance hits that the ones we picked, but we think these levels don't really represent what we expect gameplay to be throughout the entire game.
|
|
1024 x 768 | 1280 x 960 | 1600 x 1200 |
|
Performance Hit 1280 | Performance Hit 1600 |
| Bypass | 143.21 | 114.22 | 83.01 |
|
20.24% | 42.04% |
| Afterimage | 121.27 | 89.63 | 72.01 |
|
26.09% | 40.62% |
| Lapdog | 119.29 | 97.77 | 70.84 |
|
18.04% | 40.62% |
| First encounter | 120.01 | 100.3 | 71.65 |
|
16.42% | 40.30% |
| Sayonara, sucker | 139.8 | 110.58 | 84.02 |
|
20.90% | 39.90% |
| Average | 128.72 | 102.5 | 76.3 |
|
20.34% | 40.69% |
all numbers are in frame rates (except for percentages)
Notice how the average fps for the entire game is not that far off from the average fps of all top five levels in table above. So, rest assured - it's very likely the frame rates we're going to see in these levels will be fairly close to what you would expect the entire time you play F.E.A.R.
Now that we've picked levels that are more graphically intensive than others (or at the very least most responsive to higher resolutions), let's see what frame rates we can expect from different runs. Obviously, we want as little variations between runs as possible. In order to do that, we play through the entire level in a controlled way, trying to get the same run each time. We also tried to keep the firefight scenes as graphically intensive as possible - that way we end up with a 'worst case' scenario. However, do keep in mind F.E.A.R is a very fluid game, so every run will be different in some way.
To confirm repeatability between runs, we ran through the top 5 levels we've picked at 1024 x 768. Any higher will likely mean the graphical subsystem will come into play. At 1024 x 768 without AA and AF, it's very likely we'll be more system limited and from our experience, normal variations often occur because of system limitations and not graphically related.
These results are taken at 1024 x 768, without AA and AF.
|
|
1st | 2nd | 3rd |
|
Average | Variations | Variations (%) |
| Bypass | 146.26 | 145 | 148.86 |
|
146.71 | 1.97 | 1.34% |
| Afterimage | 104.29 | 103.29 | 104.13 |
|
103.9 | 0.54 | 0.52% |
| Lapdog | 112.14 | 115.73 | 117.44 |
|
115.1 | 2.71 | 2.35% |
| First encounter | 108.16 | 106.46 | 105.91 |
|
106.84 | 1.17 | 1.10% |
| Sayonara, sucker | 128.54 | 126.72 | 127.44 |
|
127.57 | 0.91 | 0.72% |
all numbers are in frame rates (except for percentages)
Wow. Those variations are pretty small, all of them are under 5 percent. In all honesty, we can use any of these levels for a proxy of the whole game. Of course, the less variations the better and in that regard Afterimage and Sayonara, sucker levels really stand out.
One thing to note though - if you look at the frame rates from the previous table, you'll see differences between the values there and this table, particularly for Afterimage and First Encounter. Now, don't worry. We expected this, since we were more focused on playing the game than trying to get similar results / outcome initially. And we did say we tried to stress the card as possible with more graphically intensive firefights (explosions, smoke, particle effects). Later on, we will see just what kind of frame rates can we expect with resolution changes. For now, let's move on.
Next comes the million dollar question, does F.E.A.R Performance Test really represent gameplay? Are frame rates similar to what you will see while you're playing the game? We took the average of three runs of each level from the previous table and compare them to what we got from the F.E.A.R Performance Test. Since we're using FRAPS in gameplay testing sessions, we also threw in FRAPS results of the F.E.A.R Performance Test, just to see how much of a difference between them, if there is any.
These results are taken at 1024 x 768, without AA and AF.
|
|
Gameplay | FRAPS |
internal |
|
Difference | Difference (%) |
| Bypass | 146.71 | 103.18 | 102.67 |
|
-43.79 | -29.85% |
| Afterimage | 103.9 | 103.18 | 102.67 |
|
-0.98 | -0.95% |
| Lapdog | 115.1 | 103.18 | 102.67 |
|
-12.18 | -10.58% |
| First encounter | 106.84 | 103.18 | 102.67 |
|
-3.92 | -3.67% |
| Sayonara, sucker | 127.57 | 103.18 | 102.67 |
|
-24.65 | -19.32% |
all numbers are in frame rates (except for percentages)
We can see that in some levels, average frame rates in gameplay testing sessions can be higher, even in levels that we think are graphically intensive. If you look at frame rates in both Afterimage and First Encounter, the frame rates are practically the same with F.E.A.R Performance Test. The other levels are much higher. If we have to hazard a guess, we think that's because these two levels have more firefights than others, making them more graphically intensive than the entire game -a 'worst case' scenario where frame rates will be the lowest compared to the other parts of the game.
From these five levels, it looks like we will be using Afterimage for our gameplay testing. Frame rates in Sayonara, sucker is much too different (much higher - too much walking around in empty rooms, corridor etc) and First Encounter results have higher variations between runs than Afterimage (though it's only still 1.1 percent). We did saw lower frame rates from the Point of Entry level, but performance hit with 1600 x 1200 is lower than these two levels, so we will stick with Afterimage for gameplay testing.
There are also subjective reasons why we chose Afterimage. For one, there is only one savegame checkpoint in this level, so it will be much easier to get minimum frame rates with FRAPS, without the interference of a save. Firefights are pretty close to each other - three times in both parts (with a more frantic second firefight in the second part) and that means less let up on the graphics. The entire level is also pretty short, something like 6 minutes, which is a time saver when doing performance evaluations.
Quake 4
Just like what we did with F.E.A.R, we played through the whole game three times at different resolutions, recording frame rates for each levels. Unlike F.E.A.R, Quake 4 is a much more difficult game, and no, we don't mean the AI. We mean it's easier to get killed by falling of a ledge and into a void. Playing with the frame rate cap off isn't going to make it any easier. At 1024 x 768, the game runs like a speed demon - even the pistol will behave like a machine gun at times. Thankfully, we managed to finish testing just fine and here are the results.|
|
1024 x 768 | 1280 x 1024 | 1600 x 1200 |
|
Performance Hit 1280 | Performance Hit 1600 |
| Air Defense Bunker | 91.76 | 93.36 | 82.82 |
|
-1.75% | 9.74% |
| Air Defense Trenches | 74.11 | 74.83 | 81.56 |
|
-0.98% | -10.06% |
| Hangar Perimeter | 89.5 | 86.01 | 72.4 |
|
3.89% | 19.11% |
| Interior Hangar | 98.48 | 92.06 | 82.81 |
|
6.53% | 15.92% |
| MCC Landing Site | 76.83 | 76.59 | 65.27 |
|
0.31% | 15.05% |
| Op: Advantage | 86.14 | 86.64 | 72.36 |
|
-0.58% | 16.00% |
| Canyon | 76.22 | 73.93 | 75.28 |
|
3.01% | 1.23% |
| Perimeter Defense Station | 90.54 | 89.48 | 72.86 |
|
1.17% | 19.53% |
| Aqueducts | 74.71 | 74.88 | 77.35 |
|
-0.23% | -3.53% |
| Aqueducts Annex | 83.23 | 81.01 | 76.52 |
|
2.67% | 8.07% |
| Nexus Hub Tunnels | 87.31 | 83.62 | 78.04 |
|
4.22% | 10.61% |
| Nexus Hub | 89.67 | 86.58 | 78.04 |
|
3.44% | 12.97% |
| Strogg Med Facilities | 97.54 | 92.69 | 90.54 |
|
4.97% | 7.17% |
| Construction Zone | 91.27 | 86.51 | 76.38 |
|
5.22% | 16.32% |
| Dispersal Facility | 98.69 | 95.5 | 93.21 |
|
3.23% | 5.55% |
| Recomposition Center | 117.38 | 117.56 | 107.52 |
|
-0.16% | 8.39% |
| Putrification Center | 92.01 | 91.34 | 87.16 |
|
0.73% | 5.28% |
| Waste Processing Facility | 106.99 | 102.02 | 102.65 |
|
4.65% | 4.06% |
| Op: Last Hope | 94.71 | 93.38 | 83.51 |
|
1.41% | 11.82% |
| Data Storage Terminal | 66.8 | 67.76 | 69.45 |
|
-1.44% | -3.98% |
| Data Storage Security | 94.88 | 93.39 | 82.13 |
|
1.57% | 13.44% |
| Data Storage Terminal | 76.85 | 75.39 | 77.84 |
|
1.90% | -1.28% |
| Tram Hub Station | 86.73 | 76.84 | 91.65 |
|
11.41% | -5.67% |
| Tram Rail | 73.16 | 70.38 | 68.6 |
|
3.80% | 6.23% |
| Data Processing Terminal | 88.17 | 71.02 | 85.28 |
|
19.45% | 3.27% |
| Data Processing Security | 85.65 | 85.28 | 80.47 |
|
0.43% | 6.05% |
| Data Processing Terminal | 119.79 | 119.57 | 104.22 |
|
0.19% | 13.00% |
| Data Networking Terminal | 79.83 | 79.9 | 73.22 |
|
-0.09% | 8.27% |
| Data Networking Security | 89.38 | 92.11 | 82 |
|
-3.05% | 8.26% |
| Nexus Core | 97.94 | 96.5 | 91.73 |
|
1.47% | 6.34% |
| The Nexus | 82.13 | 79.84 | 69.34 |
|
2.79% | 15.58% |
| Average (entire game) | 88.98 | 86.64 | 81.68 |
|
2.59% | 7.83% |
all numbers are in frame rates (except for percentages)
Surprisingly, Quake 4 is generally less responsive to resolution changes than F.E.A.R. That or we're system limited, we're more inclined to think it's the latter. Average frame rate for the entire game is pretty similar whether you're playing at 1024 x 768, 1280 x 1024 and 1600 x 1200. That's why it's important to pick levels that are more graphically intensive than others if you want to use this as a graphics card benchmark - below you'll see the the top five levels we've chosen.
|
|
1024 x 768 | 1280 x 1024 | 1600 x 1200 |
|
Performance Hit 1280 | Performance Hit 1600 |
| Perimeter Defense Station | 90.54 | 89.48 | 72.86 |
|
1.17% | 19.76% |
| Hangar Perimeter | 89.5 | 86.01 | 72.4 |
|
3.89% | 19.88% |
| Interior Hangar | 98.48 | 92.06 | 82.81 |
|
6.53% | 15.92% |
| The Nexus | 82.13 | 79.84 | 69.34 |
|
2.79% | 16.03% |
| MCC Landing Site | 76.83 | 76.59 | 65.27 |
|
0.31% | 15.05% |
| Average | 87.5 | 84.8 | 72.53 |
|
2.94% | 17.33% |
all numbers are in frame rates (except for percentages)
Ah, that's more like it. Do note that frame rates in the table above of these levels are generally lower than what you see throughout the entire game, though it's still fairly close. The performance hit of the various levels at 1280 x 1024 range from nothing to slightly higher and at 1600 x 1200 they are pretty much similar. The Interior Hangar level stands out as the one with the largest hit at 1280 x 1024, but The Nexus's performance hit is pretty close to the average. The Nexus also has the second lowest frame rate of the five, the MCC Landing Site being the first. However, we still need to confirm these results with more testing - are they repeatable? Let's look at the next table.
These results are taken at 1024 x 768, without AA and AF.
|
|
1st | 2nd | 3rd |
|
Average | Variations | Variations (%) |
| Perimeter Defense Station | 89.85 | 89.17 | 89.78 |
|
89.6 | 0.37 | 0.41% |
| Hangar Perimeter | 85.79 | 84.48 | 83.2 |
|
84.49 | 1.3 | 1.54% |
| Interior Hangar | 90.98 | 93.23 | 90.39 |
|
91.53 | 1.5 | 1.63% |
| The Nexus | 79.41 | 75.29 | 77.39 |
|
77.36 | 2.06 | 2.66% |
| MCC Landing Site | 79.11 | 72.29 | 70.92 |
|
74.11 | 4.39 | 5.92% |
all numbers are in frame rates (except for percentages)
Only the Perimeter Defense Station results are pretty close to our initial results. The other levels are roughly 2 to 6 frames lower. Remember, we wanted a worst case scenario that really pushes the graphics card. The good news is the results from these levels are repeatable - only the MCC Landing Site level have slightly larger variations between runs. The Nexus level remains the one with the second lowest average frame rate.
Alright, now we're going to see whether or not there are differences between frame rate in actual gameplay and a timedemo run. Since we're measuring frame rate in gameplay testing sessions, we are playing the game with all the bells and whistles on - AI, sound, physics etc. Differences with timedemo runs are expected, but it sure is good to see just a small difference between gameplay testing sessions and timedemo results. In the table below, you'll see the average of three runs compared to a timedemo run for each of the levels we've picked. Just to make sure FRAPS is working correctly, we've included results from both Quake 4 internal timedemo statistics and FRAPS.
These results are taken at 1024 x 768, without AA and AF.
|
|
Gameplay | FRAPS benchmark |
internal |
|
Difference | Difference (%) |
| Perimeter Defense Station | 89.6 | 104.31 | 104.2 |
|
14.66 | 16.36% |
| Hangar Perimeter | 84.49 | 91.37 | 91.5 |
|
6.95 | 8.22% |
| Interior Hangar | 91.53 | 88.36 | 88.45 |
|
-3.13 | -3.42% |
| The Nexus | 77.36 | 77.72 | 77.5 |
|
0.25 | 0.32% |
| MCC Landing Site | 74.11 | 68.48 | 68 |
|
-5.87 | -7.92% |
all numbers are in frame rates (except for percentages)
We'll that's - weird. We usually expect timedemo results to be higher than gameplay testing sessions results with the absence of sound (and some texture errors). But we sure didn't expect timedemo results to actually be lower. From the looks of it, the only level that stays fairly close is The Nexus The results from the other levels is too different (or too strange) from a timedemo run. It's probably safe to say that you shouldn't rely only on timedemo results as a measure of actual gameplay performance with Quake 4. Timedemo results can actually be very different to frame rates in actual gameplay.
After more detailed testing and some additional research, this weird behavior with Quake 4 timedemo is actually not weird at all. Timedemo don't run AI, sound, physics routines, they're straight graphical instructions for the graphics cards to render the game's graphics. That means it's very close to a pure graphics stress test, though processor, memory and bus speeds / bandwidth still have some influence on the results. By taking out the burden of AI, sound, and physics from the processor, the processor can focus most of it's processing power to supplying the graphics cards, which in turn tries to render the game as fast as it can - hence the word 'timedemo'. In our case, higher results from certain levels indicate that we are system limited in those levels. Vice versa, a lower results may indicate we are graphically limited (pay attention to the word 'may').
So, out of the five levels we've picked, the choice comes down to two - Interior Hangar and The Nexus. We've decided to choose the Nexus for gameplay testing sessions - the primary reason is the results indicate that this level have close to equal load on both the system and the graphics subsystem. There are other points in favor of The Nexus level - it lacks a checkpoint save which makes it ideal to get minimum frame rates thus no hard disk writes that will cause a stutter or a sharp drop in frame rates. After all, we believe a lower minimum fps is more of a concern than a lower average fps. Generally speaking, average fps from The Nexus are pretty close to the lowest frame rates we've seen throughout the entire game, making it quite a perfect fit for a worst case scenario.
Serious Sam II
This game is quite possibly the longest of the three - 41 levels in total. There are some pretty funny, albeit dry jokes in it like, "These rocks can shake, but I won't Quake." or even "I love collecting big guns. They're so Unreal." Hmm, what's a matter Croteam? No Half-Life jokes? The last level, Mental Institution, was quite probably the longest level we have ever played in an FPS. Even the guys at Croteam put an estimated playtime of 1 hour for this level. Wonder what they will call the sequel: Serious Sam II, well, 2? I know Serious Sam II: Double the Fun!!! Oh, well - here are the results of many long hours of playing Serious Sam II.|
|
1024 x 768 | 1280 x 960 | 1600 x 1200 |
|
Performance Hit 1280 | Performance Hit 1600 |
| Jungle | 92.26 | 91.01 | 82.37 |
|
1.35% | 10.72% |
| Riverdance | 89.28 | 84.3 | 82.59 |
|
5.58% | 7.49% |
| M'Keke Village | 84.05 | 92.44 | 82.99 |
|
-9.98% | 1.26% |
| Road to Ursul | 70.58 | 75.21 | 67.91 |
|
-6.56% | 3.78% |
| Ursul Suburbs | 83.98 | 84.99 | 80.84 |
|
-1.20% | 3.74% |
| Kukulele Prison | 80.78 | 81.31 | 72.89 |
|
-0.66% | 9.76% |
| Ursul Gardens | 75.77 | 78.25 | 77.64 |
|
-3.28% | -2.47% |
| Kwongo | 104.6 | 89.58 | 81.84 |
|
14.36% | 21.76% |
| Deadwood | 86.62 | 86.06 | 84.88 |
|
0.65% | 2.01% |
| Forsaken Compound | 74.92 | 75.81 | 81.69 |
|
-1.19% | -9.04% |
| Branchester | 81.44 | 78.36 | 78.31 |
|
3.78% | 3.84% |
| Woodstock | 95.39 | 96.54 | 95.27 |
|
-1.21% | 0.12% |
| Giant Junkyard | 56.41 | 60.46 | 62.45 |
|
-7.18% | -10.69% |
| Zum-Zum | 108.23 | 95.88 | 128.07 |
|
11.41% | -18.34% |
| Hong-Pong | 83.67 | 79.72 | 68.5 |
|
4.72% | 18.13% |
| Wheels of Fortune | 65.19 | 64.18 | 62.74 |
|
1.55% | 3.75% |
| Chanolin Chambers | 105.79 | 108.7 | 97.54 |
|
-2.76% | 7.79% |
| Chanolin Temple | 88.7 | 83.77 | 74.2 |
|
5.55% | 16.34% |
| Prince Chan | 99.5 | 90.52 | 91.94 |
|
9.02% | 7.60% |
| Unreal Wasteland | 113.9 | 115.69 | 113.48 |
|
-1.57% | 0.37% |
| Doomed Canyon | 135.17 | 129.42 | 129.16 |
|
4.26% | 4.45% |
| Boneyard | 132.35 | 123.6 | 118.57 |
|
6.61% | 10.41% |
| Count Kleerofski | 159.96 | 98.21 | 80.21 |
|
38.60% | 49.86% |
| Greendale | 81.5 | 81.87 | 83.09 |
|
-0.46% | -1.95% |
| Coast to Cost | 80.63 | 85 | 88.97 |
|
-5.42% | -10.35% |
| Kingsburg | 90.83 | 91.75 | 91.98 |
|
-1.01% | -1.26% |
| Royal Sewers | 141.35 | 140.95 | 138.95 |
|
0.28% | 1.70% |
| Castle of Rock | 108.3 | 109.12 | 97.33 |
|
-0.75% | 10.13% |
| Floaterra | 86.17 | 95.8 | 92.63 |
|
-11.18% | -7.50% |
| Cecil the Dragon | 170.89 | 179.86 | 162.56 |
|
-5.25% | 4.87% |
| Area 5100 | 96.6 | 101.97 | 97.39 |
|
-5.56% | -0.82% |
| Command Center | 110.49 | 95.93 | 100.66 |
|
13.18% | 8.90% |
| Freezepad | 104.03 | 106.38 | 9.54 |
|
-2.26% | 90.83% |
| Frostpost | 78.14 | 83.63 | 80.9 |
|
-7.03% | -3.54% |
| Hugo the Robot | 124.22 | 119.96 | 109.33 |
|
3.43% | 11.98% |
| Siriuspolis Uptown | 74.86 | 83.21 | 60.67 |
|
-11.15% | 18.95% |
| Shield Generator | 59.74 | 60.39 | 59.52 |
|
-1.08% | 0.38% |
| Be Quick of Be Dead | 97.31 | 93.71 | 97.14 |
|
3.70% | 0.17% |
| Welcome to the Jungle | 87.51 | 95.28 | 83.55 |
|
-8.88% | 4.53% |
| Jump in the Fire | 125.03 | 106.64 | 98.95 |
|
14.71% | 20.86% |
| Siriuspolis Downtown | 78.64 | 83.57 | 78.87 |
|
-6.27% | -0.29% |
| Mental Institution | 103.81 | 96.63 | 85.52 |
|
6.91% | 17.61% |
| Average (entire game) | 96.87 | 94.66 | 88.42 |
|
1.14% | 7.33% |
all numbers are in frame rates (except for percentages)
The entire game seems to be quite system limited. It would've been a much more different picture with a faster processor or a slower graphics card but we didn't have either of them in hand so these will have to do. Average performance hit for the entire game at 1600 x 1200 is just around 7 percent, that's just how system limited this game is in our test setup. Obviously, we have to pick the more graphically intensive levels, where the performance hit is substantially larger than that. We didn't pick the Boss levels (Kwongo, Count Kleerofski) since they're pretty small and not really your typical Serious Sam II level. You can see the top 5 levels below.
|
|
1024 x 768 | 1280 x 960 | 1600 x 1200 |
|
Performance Hit 1024 to 1280 | Performance Hit 1024 to 1600 |
| Jump in the Fire | 125.03 | 106.64 | 98.95 |
|
14.71% | 20.86% |
| Siriuspolis Uptown | 74.86 | 83.21 | 60.67 |
|
-11.15% | 18.95% |
| Hong-Pong | 83.67 | 79.72 | 68.5 |
|
4.72% | 18.13% |
| Mental Institution | 103.81 | 96.63 | 85.52 |
|
6.91% | 17.61% |
| Chanolin Temple | 88.7 | 83.77 | 74.2 |
|
5.55% | 16.34% |
|
Average |
95.21 | 89.99 | 77.57 |
|
4.15% | 18.38% |
all numbers are in frame rates (except for percentages)
Average frame rates does seem to differ from the entire game, ranging from 6 to 30 fps - a very big difference. The performance hit seems to indicate these levels are more graphically intensive than others, but would it be enough? Out of these levels, only Mental Institution have an average fps that's close to what we've seen throughout the entire game. Personally, we would rather not use this level - it's just too damn long. From a performance penalty perspective, the Hong-Pong level is fairly close to the average performance penalty, followed by Chanolin Temple. Now, let's see if these numbers are repeatable.
These results are taken at 1024 x 768, without AA and AF.
|
|
1st | 2nd | 3rd |
|
Average | Variations | Variations (%) |
| Jump in the Fire | 90.75 | 92.95 | 91.86 |
|
91.86 | 1.1 | 1.20% |
| Siriuspolis Uptown | 71.21 | 67.78 | 66.42 |
|
68.47 | 2.47 | 3.61% |
| Hong-Pong | 73.6 | 74.47 | 73.04 |
|
73.7 | 0.72 | 0.98% |
| Mental Institution | 100.68 | 99.47 | 100.43 |
|
100.2 | 0.64 | 0.64% |
| Chanolin Temple | 87.88 | 85.47 | 82.79 |
|
85.38 | 2.55 | 2.98% |
all numbers are in frame rates (except for percentages)
We tried to play these levels in a similar manner each time and by doing so, we've managed to keep variations pretty low. However, we can also see actual frame rates are much lower than our initial run. Only Mental Institution's and Chanolin Temple's results are fairly close to our initial results. However, Chanolin Temple seems to have a slightly higher variation between runs than Hong-Pong and Mental Institution. For our sanity's sake, let's hope it would be either Chanolin Temple or Hong-Pong, preferable Hong-Pong since it's much shorter and reacts much like other levels with resolution changes.
OK. Time to see if we could rely on replay instead of playing the game manually by hand. Will frame rates from a replay be indicative of actual frame rates in gameplay? Serious Sam II don't have a timedemo, but it does have a replay that can be used as a benchmark tool with the proper console command. We record a replay for each level, then we played that replay, measuring performance both with Serious Sam II internal benchmarking tool and FRAPS. We did this just to make sure FRAPS is recording and reporting accurately. Here are the results for each level.
These results are taken at 1024 x 768, without AA and AF.
|
|
Gameplay | FRAPS benchmark | internal |
|
Difference | Difference (%) |
| Jump in the Fire | 91.86 | 86.84 | 87.8 |
|
-4.53 | -4.94% |
| Siriuspolis Uptown | 68.47 | 56.27 | 56.75 |
|
-11.96 | -17.47% |
| Hong-Pong | 73.7 | 68.58 | 69.55 |
|
-4.64 | -6.29% |
| Mental Institution | 100.2 | 90.36 | 91.1 |
|
-9.47 | -9.45% |
| Chanolin Temple | 85.38 | 72.54 | 73.45 |
|
-12.38 | -14.50% |
all numbers are in frame rates (except for percentages)
Ouch. Another strange behavior, but thankfully more predictable this time. Looks like replay results are lower than gameplay, making them poor not representative of frame rates in actual gameplay. The difference between replays and gameplay testing sessions can even be significant, so it's definitely not just normal variations between runs. Remember, variations between runs are fairly small with these levels (0.64 - 3.61 percent). In all honesty, our option are now limited to two levels - Jump in the Fire and Hong-Pong - two levels that behave similarly in replays and gameplay testing sessions. Out of the two, we pick Jump In The Fire because it's shorter.
We've seen the results from three newer games in our benchmark suite and it's not pretty. Out of the three, only F.E.A.R performance test offer results that are close enough to frame rate in actual gameplay. Quake 4 timedemo is not only broken (misplaced textures, fonts), but also behave rather erratically with some levels. Serious Sam II replay incurs quite a significant overhead in some cases, offering results that are actually lower than frame rate in actual gameplay.
It's beginning to look like timedemo and replays can not really represent what kind of frame rate you can expect from a game, even with a developer supplied demo / replay. However, that doesn't they're completely useless. They can still provide us some idea of how a game will behave in graphically limited situations, just not exact level of frame rates in actual gameplay. For a more representative test results, actual gameplay testing sessions is a better tool. Now that we have our candidates for gameplay testing, let's see how do they stand up in a real performance test.
Performance
AA and AF settings are applied in game, whenever possible. In Quake 4, we enable both AA and AF from the console, with the command 'r_multisample' set to 4 and ;'image_anisotropy' set to 16. For gameplay testing, we removed the frame rate cap by entering the command 'com_fixedtic 1'. V sync was disabled both from inside the game and on the driver panel. AA and AF is enabled from the settings menu for F.E.A.R and Serious Sam IIF.E.A.R settings are set to our usual test settings (all settings set to maximum / on except for soft shadows) and volumetric lights was turned on. Quake 4 was set to 'High Quality' while Serious Sam II was tested with our usual standard settings (maximum settings but with custom resolution, anti aliasing and anisotropic filtering settings).
The graphs are pretty self explanatory, but in case you can't see the text, they're arranged in groups of three colors: green for minimum fps, blue for average fps and red for maximum fps. The first group represent test results under default settings, and the second represent results with 4xAA and 16xAF enabled. These results are in frame rates per second.
F.E.A.R, 1024 x 768
Afterimage 1
Afterimage 2
FEAR Performance Test
F.E.A.R, 1280 x 960
Afterimage 1
Afterimage 2
FEAR Performance Test
F.E.A.R, 1600 x 1200
Afterimage 1
Afterimage 2
FEAR Performance Test
|
|
1024 x 768 | 1280 x 960 | 1600 x 1200 |
|
Performance Hit 1280 | Performance Hit 1600 |
| Afterimage | 103.9 | 87.13 | 63.57 |
|
16.14% | 38.82% |
| FEAR Performance Test | 106.72 | 85.5 | 64.26 |
|
19.88% | 39.78% |
| Average (entire game) | 120.45 | 98.91 | 73.75 |
|
17.76% | 38.69% |
| Average (top 5 graphical levels) | 128.72 | 102.5 | 76.3 |
|
20.34% | 40.69% |
|
|
1024 x 768 ;AA AF | 1280 x 960 ;AA AF | 1600 x 1200 ;AA AF |
|
|
|
| Afterimage | 73.23 | 52.61 | 36.67 |
|
28.16% | 49.93% |
| FEAR Performance Test | 72.41 | 53.35 | 37.22 |
|
26.32% | 48.60% |
all numbers are in frame rates (except for percentages)
By not including checkpoint saves in our gameplay testing runs, we were able to get a pretty repeatable minimum frame rate for the Afterimage level. Although the F.E.A.R Performance Test was pretty much on the mark with average fps, it's a slightly hit and miss affair on the minimum fps. Take a close look at 1024 x 768, AA / AF results. On the second part of Alterimage, the second firefight was graphically intensive enough to reach a minimum fps of 30 fps. while the F.E.A.R Performance Test 'only' dropped to 38 fps. That's quite a huge difference - 25 percent. We also saw a similar difference at 1600 x 1200 without AA and AF. To be fair, the F.E.A.R Performance Test was pretty close at 1024 x 768 and 1280 x 960 without AA and AF, so it's obviously good enough for quick benchmarks, but not for more detailed information.
On the subject of representing the entire game, F.E.A.R Performance Test does a pretty good job in showing a worst case scenario. Yes, frame rates are lower than what you will be seeing throughout the entire game, but the performance penalty is similar. We didn't run the entire game without AA and AF, but we think that the difference between gameplay testing sessions for the entire game and F.E.A.R Performance Test should be minimal. There will be levels that have lower frame rates than Afterimage and F.E.A.R performance test, so we think it would be wise to keep minimum frame rate close to 40 fps, if possible. Looking back to our various reviews, we recommend a slightly higher clocked GeForce 7900GT / 7800GTX or Radeon X1900XTX / X1900XT and equivalent to play at our test settings at 1024 x 768, 4x AA and 16x AF. Higher resolutions is only recommended with faster graphics card(s).
Quake 4, 1024 x 768
Data Processing Terminal, Gameplay
Data Processing Terminal, Timedemo
The Nexus, Gameplay
The Nexus, Timedemo
Quake 4, 1280 x 1024
Data Processing Terminal, Gameplay
Data Processing Terminal, Timedemo
The Nexus, Gameplay
The Nexus, Timedemo
Quake 4, 1600 x 1200
Data Processing Terminal, Gameplay
Data Processing Terminal, Timedemo
The Nexus, Gameplay
The Nexus, Timedemo
|
|
1024 x 768 | 1280 x 1024 | 1600 x 1200 |
|
Performance Hit 1280 | Performance Hit 1600 |
| Data Processing Terminal Gameplay | 119.12 | 117.66 | 105.76 | 1.22% | 11.22% | |
| Data Processing Terminal Timedemo | 109.96 | 111.43 | 107.89 | -1.34% | 1.89% | |
| The Nexus Gameplay | 85.31 | 81.93 | 71.2 |
|
3.97% | 16.54% |
| The Nexus Timedemo | 80.65 | 78.98 | 67.11 | 2.07% | 16.78% | |
| Average (entire game) | 88.98 | 86.64 | 81.68 |
|
2.59% | 7.83% |
| Average (top 5 graphical levels) | 87.5 | 84.8 | 72.53 |
|
2.94% | 17.33% |
|
|
1024 x 768 AA AF | 1280 x 1024 AA AF | 1600 x 1200 AA AF |
|
|
|
| Data Processing Terminal Gameplay | 98.08 | 69.01 | 50.68 | 29.64% | 48.32% | |
| Data Processing Terminal Timedemo | 106.46 | 82.51 | 61.45 | 22.49% | 42.28% | |
| The Nexus Gameplay | 64.13 | 44.02 | 32.46 |
|
31.37% | 49.39% |
| The Nexus Timedemo | 70.18 | 56.34 | 42.15 | 19.72% | 39.94% |
all numbers are in frame rates (except for percentages)
Since these two levels do not have checkpoint saves, we were also able to get a pretty repeatable minimum frame rate for both Data Processing Terminal and The Nexus. We saw earlier how gameplay testing sessions and timedemo runs results can be very different and here we see it rears it's ugly head again. At 1024 x 768, without AA / AF, we have a higher minimum fps and average fps in gameplay testing sessions, while with AA and AF we have a higher minimum fps and lower average fps than timedemo runs in Data Processing Terminal. The same applies to The Nexus at 1024 x 768.
At 1280 x 1024 without AA and AF, gameplay testing sessions results are still generally higher than timedemo runs, but at 1600 x 1200 they're pretty much in the same in Data Processing Terminal. With AA and AF, 1280 x 1024 and 1600 x 1200 behave differently - gameplay testing sessions results are higher at 1280 x 1024 and timedemo results are higher at 1600 x 1200. The Nexus is a different matter - timedemo results are slower without AA / AF and faster with AA / AF at both 1280 x 1024 and 1600 x 1200. We ran all these tests 3 times and while there are very small variations between runs (1 to 3 fps), variations pretty much disappear as the graphical load gets heavier.
These facts lead us to an interesting conclusion: by comparing results from timedemo runs and gameplay testing sessions, we were able to pinpoint at what resolution the game (or a particular level) becomes graphically limited instead of system limited. In Data Processing Terminal, the crossover point is 1600 x 1200 without AA and AF. With AA and AF enabled, the crossover point happens a lot sooner for both levels - below 1024 x 768. The Nexus still remains system limited at 1600 x 1200 without AA and AF, but it too became graphics limited at 1024 x 768. As a quick performance test, we think a gameplay testing session on the Data Processing Terminal is still valid as a graphical benchmark because performance hit is still very high, though we had to admit the hit is higher than timedemo runs which means system limitations does add to the hit here. However, the frame rates we got from this method is closer to what you may actually see when you're playing the game - Quake 4 has a frame rate cap of 60 fps by default.
On the subject of representing the entire game (with a gameplay testing session), we think The Nexus is more representative than Data Processing Terminal. Frame rates are generally in line to what we saw through the entire game, particularly in levels that are graphically intensive, which sits nicely as a worst case scenario. Performance penalty hit rate is slightly different than timedemo runs, but close enough that any difference can be considered normal variations between runs. There are levels that have lower frame rates than The Nexus, but these levels are quite possibly system limited with our setup, so a faster processor than our Athlon 3500+ will be more useful in this case. For graphics cards, like F.E.A.R, we recommend a slightly higher clocked GeForce 7900GT / 7800GTX or Radeon X1900XTX / X1900XT and equivalent to play at our test settings at 1024 x 768, 4x AA and 16x AF. Higher resolutions is only recommended with faster graphics card(s).
Serious Sam II, 1024 x 768
Greendale, Replay
Jump in the fire, Replay
Jump in the fire, Gameplay
Serious Sam II, 1280 x 960
Greendale, Replay
Jump in the fire, Replay
Jump in the fire, Gameplay
Serious Sam II, 1600 x 1200
Greendale, Replay
Jump in the fire, Replay
Jump in the fire, Gameplay
|
|
1024 x 768 | 1280 x 960 | 1600 x 1200 |
|
Performance Hit 1280 | Performance Hit 1600 |
| Jump in The Fire Gameplay | 96.03 | 95.22 | 94.73 |
|
0.84% | 1.35% |
| Jump in The Fire Replay | 90.42 | 89.59 | 88.97 |
|
0.91% | 1.60% |
| Greendale Replay | 59.27 | 58.94 | 57.42 |
|
0.56% | 3.13% |
| Average (entire game) | 96.87 | 94.66 | 88.42 |
|
1.14% | 7.33% |
| Average (top 5 graphical levels) | 95.21 | 89.99 | 77.57 |
|
4.15% | 18.38% |
|
|
1024 x 768 AA AF | 1280 x 960 AA AF | 1600 x 1200 AA AF |
|
|
|
| Jump in The Fire Gameplay | 96.64 | 94.24 | 76.61 |
|
2.48% | 20.73% |
| Jump in The Fire Replay | 88.07 | 87.93 | 77.79 |
|
0.17% | 11.68% |
| Greendale Replay | 57.41 | 49.12 | 34.73 |
|
14.44% | 39.50% |
all numbers are in frame rates (except for percentages)
It may be hard to see from the graph, but the results in the table shows us that this game is system limited without AA and AF. Even with AA and AF, we only saw a change in average fps once we hit 1600 x 1200 with the Jump in the fire level. As benchmarking goes, we actually have a more informative benchmark on the developer supplied Greendale replay, simply because it scales better. So, we really didn't learn anything with this game beside the fact that replays produces lower frame rates than what you will see in actual gameplay
Conclusion:
So are timedemo and replay useless? After all, they are not able to provide results that are close to frame rate in actual gameplay. We do not think so. Unlike gameplay testing sessions runs, timedemo runs and replays are very repeatable and are very good in pointing out performance influencing factors and bottlenecks (graphics cards, processor, memory, disk writes / reads). Quake 4 really illustrate this point - most of the time you play in this game you're bottlenecked by the processor and not the graphics card with a similar setup like ours.Also remember that gameplay testing sessions will likely have a higher variation between runs than a timedemo, simply because each run will be different in some way. Though you can minimize the effect it has on average frame rates, it would be very hard to compare per second frame rates. Since there's always the chance of higher normal variations between each runs with gameplay testing sessions, we can draw the wrong conclusions from this method of performance testing. Our experience with Serious Sam II is an example of this. The differences between results at 1024 x 768 and 1280 x 960 we got from our initial runs was actually normal variations between runs and it's only after we perform more rigorous testing do we see the entire game is very likely system limited in our setup.
Running a benchmark with a timedemo / replay is also quicker and way easier than doing a gameplay testing session. By carefully doing research and picking a level that mostly stress one component or even two components, we can get a good estimate of the average fps that's pretty close to what we can get with a real gameplay testing session. The added bonus of this is we also get a pretty good idea which level represent frame rates we will see throughout the entire game.
Our experience doing gameplay testing sessions and timedemo runs / replay teaches us that every performance test tool is not perfect. Each have their own advantages and disadvantages. Both should be used to try and 'capture' what kind of performance a hardware can offer for a particular game. However, even using both doesn't mean we will be able to capture actual gameplay frame rates in its entirety. Remember, we made a subjective decision to play the game in a particular way that will be graphically intensive and produce repeatable results. We doubt real life gaming sessions from different gamers with different styles of play will be the same as ours. However, we're pretty sure that the results portray a worst case scenario - all things being equal, you should see slightly higher frame rates in these games with a similar setup.
Another thing to note is that these results are taken from the singleplayer levels of these games. Multiplayer levels are usually 'lighter' in graphical intensity, particularly since developers have no real control (no scripted events) for each room / corridor in a multiplayer level and stick to a polygon / texture / shader budget. There are also mods to consider - some mods, like the Excessive mod for Quake fans will likely put more strain on graphics cards than the default mode of play.
Then there is a trend of more and more games are becoming 'open ended' and using free roam environments. With these new games (The Elder Scrolls: Oblivion, Need for Speed: Most Wanted, Grand Theft Auto etc), it will difficult if not impossible to reprensent gameplay with gameplay testing sessions or even try to cover every type of environment that's available in the game. In these cases, reviewers usually aim for a worst case scenario and conduct gameplay testing sessions in a scene / environment that's more graphically intensive than others. Of course, that brings us to the first problem - results of these types of gameplay testing sessions will not reflect frame rates you will encounter in the game.
Go to top