Tech-Hounds.com

Because gamers play games, not benchmarks




Benchmarking on Trial - Part 1

Performance evaluations are never easy and the argument of using synthetic and real world benchmarks is one that's neverending among reviewers, the general public and even manufacturers. However, it is important to realize that there is a difference between benchmarks and reviews - benchmarks is usually a part of a full review, a part that focuses on performance since it is relatively easy to measure than other features of a product (image / sound quality, ease of use and installation etc).

Since most people do put performance first when buying a hardware, it's only natural for reviews to put more focus on performance than say, value added features. Benchmarks are usually done with a fixed setup where only one type of hardware is changed - the hardware that's being benchmarked. As a reference point, a similar product, either in performance, capabilities or price is also benchmarked. Afterwards, the results of the benchmarks are compared to see which one offer the higher performance, according to the benchmark's metrics.

Once we have a performance comparison, the reviewer will usually add some other objective and subjective evaluations, ranging from their experience with the product, some notable differences in features and output quality, usually visual quality of the images for graphics cards. Noise, heat levels, power consumption are also becoming more important than ever. Price may also be an influencing factor - let's face it, if two products being compared basically offer the same performance, we would either go with the one with more features or less price - all things being equal (warranty / support). A slightly more difficult comparison will be if the the two product differs in both price and performance. Usually in these cases, a price / performance (and sometimes plus features) ratio is used, but these methods don't take into account the concern of the very restrictive price buyers - one that will only spend a certain amount of money on a product and no higher.

Back to the subject - benchmarks. Since we want to evaluate a hardware - a specific aspect of a system - it's typical for reviewers to use benchmarks that scales with that particular hardware, be it a processor, memory, graphics cards or other components. Synthetic benchmarks is the ideal testing tool in this regard - it completely isolates a particular aspect of a system. However, keep in mind that performance with real applications in real life scenarios are often bottlenecked by several aspects. For instance, a graphics card can only render as fast as the processor can supply data to it and the processor can only process data if the data is ready in it's memory. An overly simplistic example, but it's not far from the truth. Just look at the many components involved in that statement - graphics card, processor and memory. There's also the processor's front side bus, memory controller and even the graphics card's interface. That's why reviewers in general (and also the general public) favor benchmark results from real world application or games that they use / play. They have more bearing on what kind of performance they can expect by adding a particular hardware to their system.

But even the use of real world applications and games as benchmarks have come under fire in recent years. Since benchmarks needs to produce repeatable / reproducible results, they often make use of a series of actions or scenarios. While makers of benchmarks typically state that these scenarios are selected as close to possible to real world day to day use, they will not cover all type of use. Benchmarks using real world applications or games are not immune to this. More so for graphical benchmarks with games, remember that game developers do not focus their game code solely for graphics. They must develop their game to at least be equally balanced between the processor and graphics card. In this regard, a graphically oriented benchmark - usually a replay or a timedemo that solely focus on graphics performance will not offer benchmark results that portray typical frame rate for the entire game. There will be times in the game where you will have graphically intensive scenes and less intensive moments. Naturally frame rates between those two kinds of scenes will be different, less intensive scenes will have higher frame rates and the more intensive scenes have lower frame rates. There' will also be scenes where you are more bottlenecked by the processor or memory subsystem rather than the graphics card. In these scenes the graphics card won't be able to reach its peak potential and results are lower than they can be.

So, in that regard, graphical benchmarks using timedemos or replays are similar to synthetic benchmarks - they isolate the load on a specific hardware, in this case the graphics card despite they are real world applications or games. As a natural progression, reviewers and manufacturers are now using more 'real world' performance testing with games in particular. The idea is to monitor frame rate during actual gameplay and not replays or timedemos so that the results are as close as can be to actual use or in this case actual gameplay frame rates users and gamers can expect. To keep repeatibility and reproducibility, each gameplay session is done in a specific manner so the outcome is similar each time. Results from repeated runs are then averaged as a proxy for that scenario, since there are variations between runs which is higher than what can be expected from timedemos.

Today we're going to share our experiences with both of these benchmark tools - gameplay testing sessions and replay / timedemo runs. This article is not meant as an argument both for or against either of them, rather an 'adventure' in exploring peculiarities of both tools.

Problems

Of the three games we're going to test today (F.E.A.R, Quake 4 and Serious Sam II), we have pretty much relied solely on gameplay testing session for Quake 4. While this method of testing does have higher system overhead than a timedemo, the graphics rendering is more accurate. Quake 4's timedemo is 'broken', meaning some shadows are not displayed, some textures are missing and the wrong fonts are used on OSD displays like the ones you see on elevator / lift control panels and switches. Here are some examples:

 

left: a misplaced OSD texture
right: notice there's shadows on the gameplay screenshot and none on the timedemo screenshot

Lately we have found results we're getting in Quake 4 don't seem to really represent frame rates we've experienced throughout the entire game, but we did choose the Data Processing Terminal (the 2nd level with the tower guardian) level because it scales pretty well with changes in resolution and graphical features. The same can also be said for Serious Sam II. With powerful graphic cards such as the GeForce 7900 and Radeon X1900 series, it's obvious we have to enable AA and AF to make the Greendale demo scale with changes in resolution. We hope by examining which level in Serious Sam II scales better than the developer supplied Greendale demo, we can use that new level for additional benchmarks in our performance evaluation.

Another problem in doing gameplay testing sessions is a boon for gamers, but a nightmare for reviewers - checkpoint saves. All three games have checkpoint saves in most of the levels. While they are helpful in playing the game, they often cause a stutter / a sharp drop in frame rate because the game needs to write data to the hard drive. Because of this, we will only relying on average frame rate for most of this article. Hopefully, we can find levels that doesn't have any checkpoints, or if there's any, can be circumvented in some way.

The test setup and settings for the game are the same as our usual standard setup and settings, an Athlon 3500+ and a GeForce 7900GT running at 520 / 700 MHz (non reference core and memory clocks). Our focus is on graphical benchmarks, so we are trying to find levels that scale graphically. Here are the steps we've taken to narrow the choices down.
OK, let's get on with the show.

F.E.A.R

To get an idea of what frame rates we could expect from this game, we decided to play the game from start to finish three times, each with different resolutions. Please note that these are raw numbers and these numbers will likely be different with each run. We're not focusing on repeatability or reproducible results at this moment, instead trying to get a general idea which of F.E.A.R's levels are very responsive to resolution changes.

The number of pixels at 1280 x 960 and 1600 x 1200 are roughly, 1.5 and 2.4 times the number of pixels at 1024 x 768. So, in a 100 percent graphically limited situations, we expect the performance hit to be around 30 and 50 percent, respectively. Of course, that's a raw estimate and we will never be fully 100 percent graphically limited in a game. Now, let's look at those numbers


1024 x 768 1280 x 960 1600 x 1200
Performance Hit 1280 Performance Hit 1600
Point of origin 124.27 106.23 86.55
14.51% 30.35%
First encounter 120.01 100.3 71.65
16.42% 40.30%
Infiltration 129.53 108.44 81.77
16.29% 36.87%
Heavy resistance 116.22 94.96 71.96
18.30% 38.09%
Bad water 129.24 101.69 78.05
21.32% 39.61%
Exeunt omnes 117.71 102.06 78.84
13.29% 33.02%
LZ is hot 119.57 101.69 75.06
14.95% 37.22%
Watchers 121.74 100.11 73.4
17.77% 39.71%
Bishop 116.2 97.52 70.98
16.07% 38.91%
Blindside 129.08 105.44 81.61
18.31% 36.78%
Sayonara, sucker 139.8 110.58 84.02
20.90% 39.90%
Unauthorized personnel 120.16 100.25 73.31
16.56% 38.99%
Afterimage 121.27 89.63 72.01
26.09% 40.62%
Alice Wade 110.03 93.04 68.32
15.43% 37.91%
Flight 114.61 92.93 73.38
18.92% 35.97%
Urban decay 106.91 92.64 64.61
13.35% 39.57%
Point of entry 97.09 82.65 63.48
14.87% 34.62%
Lapdog 119.29 97.77 70.84
18.04% 40.62%
Bypass 143.21 114.22 83.01
20.24% 42.04%
The Vault 122.46 102.13 66.34
16.60% 45.83%
Ground Zero 125.41 94.46 68.44
24.68% 45.43%
Aftermath 106.19 87.18 64.89
17.90% 38.89%
Average (entire game)
120.45 98.91 73.75
17.76% 38.69%

all numbers are in frame rates (except for percentages)

F.E.A.R reacts so nicely to resolution changes, which means you most likely will be seeing lots of graphically limited scenes throughout the entire game. Although we can prove this by benchmarking the entire game, that would take too much time and effort. To simplify things a bit, we took the five top levels with the most performance hit at 1600 x 1200 and 1280 x 960 and put them in the table below. Undoubtedly, you'll notice some other levels such as Ground Zero and The Vault have higher performance hits that the ones we picked, but we think these levels don't really represent what we expect gameplay to be throughout the entire game.


1024 x 768 1280 x 960 1600 x 1200
Performance Hit 1280 Performance Hit 1600
Bypass 143.21 114.22 83.01
20.24% 42.04%
Afterimage 121.27 89.63 72.01
26.09% 40.62%
Lapdog 119.29 97.77 70.84
18.04% 40.62%
First encounter 120.01 100.3 71.65
16.42% 40.30%
Sayonara, sucker 139.8 110.58 84.02
20.90% 39.90%
Average 128.72 102.5 76.3
20.34% 40.69%

all numbers are in frame rates (except for percentages)

Notice how the average fps for the entire game is not that far off from the average fps of all top five levels in table above. So, rest assured - it's very likely the frame rates we're going to see in these levels will be fairly close to what you would expect the entire time you play F.E.A.R.

Now that we've picked levels that are more graphically intensive than others (or at the very least most responsive to higher resolutions), let's see what frame rates we can expect from different runs. Obviously, we want as little variations between runs as possible. In order to do that, we play through the entire level in a controlled way, trying to get the same run each time. We also tried to keep the firefight scenes as graphically intensive as possible - that way we end up with a 'worst case' scenario. However, do keep in mind F.E.A.R is a very fluid game, so every run will be different in some way.

To confirm repeatability between runs, we ran through the top 5 levels we've picked at 1024 x 768. Any higher will likely mean the graphical subsystem will come into play. At 1024 x 768 without AA and AF, it's very likely we'll be more system limited and from our experience, normal variations often occur because of system limitations and not graphically related.

These results are taken at 1024 x 768, without AA and AF.


1st 2nd 3rd
Average Variations Variations (%)
Bypass 146.26 145 148.86
146.71 1.97 1.34%
Afterimage 104.29 103.29 104.13
103.9 0.54 0.52%
Lapdog 112.14 115.73 117.44
115.1 2.71 2.35%
First encounter 108.16 106.46 105.91
106.84 1.17 1.10%
Sayonara, sucker 128.54 126.72 127.44
127.57 0.91 0.72%

all numbers are in frame rates (except for percentages)

Wow. Those variations are pretty small, all of them are under 5 percent. In all honesty, we can use any of these levels for a proxy of the whole game. Of course, the less variations the better and in that regard Afterimage and Sayonara, sucker levels really stand out.

One thing to note though - if you look at the frame rates from the previous table, you'll see differences between the values there and this table, particularly for Afterimage and First Encounter. Now, don't worry. We expected this, since we were more focused on playing the game than trying to get similar results / outcome initially. And we did say we tried to stress the card as possible with more graphically intensive firefights (explosions, smoke, particle effects). Later on, we will see just what kind of frame rates can we expect with resolution changes. For now, let's move on.

Next comes the million dollar question, does F.E.A.R Performance Test really represent gameplay? Are frame rates similar to what you will see while you're playing the game? We took the average of three runs of each level from the previous table and compare them to what we got from the F.E.A.R Performance Test. Since we're using FRAPS in gameplay testing sessions, we also threw in  FRAPS results of the F.E.A.R Performance Test, just to see how much of a difference between them, if there is any.

These results are taken at 1024 x 768, without AA and AF.


Gameplay FRAPS internal

Difference Difference (%)
Bypass 146.71 103.18 102.67
-43.79 -29.85%
Afterimage 103.9 103.18 102.67
-0.98 -0.95%
Lapdog 115.1 103.18 102.67
-12.18 -10.58%
First encounter 106.84 103.18 102.67
-3.92 -3.67%
Sayonara, sucker 127.57 103.18 102.67
-24.65 -19.32%

all numbers are in frame rates (except for percentages)

We can see that in some levels, average frame rates in gameplay testing sessions can be higher, even in levels that we think are graphically intensive. If you look at frame rates in both Afterimage and First Encounter, the frame rates are practically the same with F.E.A.R Performance Test. The other levels are much higher. If we have to hazard a guess, we think that's because these two levels have more firefights than others, making them more graphically intensive than the entire game -a 'worst case' scenario where frame rates will be the lowest compared to the other parts of the game.

From these five levels, it looks like we will be using Afterimage for our gameplay testing. Frame rates in Sayonara, sucker is much too different (much higher - too much walking around in empty rooms, corridor etc) and First Encounter results have higher variations between runs than Afterimage (though it's only still 1.1 percent). We did saw lower frame rates from the Point of Entry level, but performance hit with 1600 x 1200 is lower than these two levels, so we will stick with Afterimage for gameplay testing.

There are also subjective reasons why we chose Afterimage. For one, there is only one savegame checkpoint in this level, so it will be much easier to get minimum frame rates with FRAPS, without the interference of a save. Firefights are pretty close to each other - three times in both parts (with a more frantic second firefight in the second part) and that means less let up on the graphics. The entire level is also pretty short, something like 6 minutes, which is a time saver when doing performance evaluations.

Quake 4

Just like what we did with F.E.A.R, we played through the whole game three times at different resolutions, recording frame rates for each levels. Unlike F.E.A.R, Quake 4 is a much more difficult game, and no, we don't mean the AI. We mean it's easier to get killed by falling of a ledge and into a void. Playing with the frame rate cap off isn't going to make it any easier. At 1024 x 768, the game runs like a speed demon - even the pistol will behave like a machine gun at times. Thankfully, we managed to finish testing just fine and here are the results.


1024 x 768 1280 x 1024 1600 x 1200
Performance Hit 1280 Performance Hit 1600
Air Defense Bunker 91.76 93.36 82.82
-1.75% 9.74%
Air Defense Trenches 74.11 74.83 81.56
-0.98% -10.06%
Hangar Perimeter 89.5 86.01 72.4
3.89% 19.11%
Interior Hangar 98.48 92.06 82.81
6.53% 15.92%
MCC Landing Site 76.83 76.59 65.27
0.31% 15.05%
Op: Advantage 86.14 86.64 72.36
-0.58% 16.00%
Canyon 76.22 73.93 75.28
3.01% 1.23%
Perimeter Defense Station 90.54 89.48 72.86
1.17% 19.53%
Aqueducts 74.71 74.88 77.35
-0.23% -3.53%
Aqueducts Annex 83.23 81.01 76.52
2.67% 8.07%
Nexus Hub Tunnels 87.31 83.62 78.04
4.22% 10.61%
Nexus Hub 89.67 86.58 78.04
3.44% 12.97%
Strogg Med Facilities 97.54 92.69 90.54
4.97% 7.17%
Construction Zone 91.27 86.51 76.38
5.22% 16.32%
Dispersal Facility 98.69 95.5 93.21
3.23% 5.55%
Recomposition Center 117.38 117.56 107.52
-0.16% 8.39%
Putrification Center 92.01 91.34 87.16
0.73% 5.28%
Waste Processing Facility 106.99 102.02 102.65
4.65% 4.06%
Op: Last Hope 94.71 93.38 83.51
1.41% 11.82%
Data Storage Terminal 66.8 67.76 69.45
-1.44% -3.98%
Data Storage Security 94.88 93.39 82.13
1.57% 13.44%
Data Storage Terminal 76.85 75.39 77.84
1.90% -1.28%
Tram Hub Station 86.73 76.84 91.65
11.41% -5.67%
Tram Rail 73.16 70.38 68.6
3.80% 6.23%
Data Processing Terminal 88.17 71.02 85.28
19.45% 3.27%
Data Processing Security 85.65 85.28 80.47
0.43% 6.05%
Data Processing Terminal 119.79 119.57 104.22
0.19% 13.00%
Data Networking Terminal 79.83 79.9 73.22
-0.09% 8.27%
Data Networking Security 89.38 92.11 82
-3.05% 8.26%
Nexus Core 97.94 96.5 91.73
1.47% 6.34%
The Nexus 82.13 79.84 69.34
2.79% 15.58%
Average (entire game) 88.98 86.64 81.68
2.59% 7.83%

all numbers are in frame rates (except for percentages)

Surprisingly, Quake 4 is generally less responsive to resolution changes than F.E.A.R. That or we're system limited, we're more inclined to think it's the latter. Average frame rate for the entire game is pretty similar whether you're playing at 1024 x 768, 1280 x 1024 and 1600 x 1200. That's why it's important to pick levels that are more graphically intensive than others if you want to use this as a graphics card benchmark  - below you'll see the the top five levels we've chosen.


1024 x 768 1280 x 1024 1600 x 1200
Performance Hit 1280 Performance Hit 1600
Perimeter Defense Station 90.54 89.48 72.86
1.17% 19.76%
Hangar Perimeter 89.5 86.01 72.4
3.89% 19.88%
Interior Hangar 98.48 92.06 82.81
6.53% 15.92%
The Nexus 82.13 79.84 69.34
2.79% 16.03%
MCC Landing Site 76.83 76.59 65.27
0.31% 15.05%
Average 87.5 84.8 72.53
2.94% 17.33%

all numbers are in frame rates (except for percentages)

Ah, that's more like it. Do note that frame rates in the table above of these levels are generally lower than what you see throughout the entire game, though it's still fairly close. The performance hit of the various levels at 1280 x 1024 range from nothing to slightly higher and at 1600 x 1200 they are pretty much similar. The Interior Hangar level stands out as the one with the largest hit at 1280 x 1024, but The Nexus's performance hit is pretty close to the average. The Nexus also has the second lowest frame rate of the five, the MCC Landing Site being the first. However, we still need to confirm these results with more testing - are they repeatable? Let's look at the next table.

These results are taken at 1024 x 768, without AA and AF.


1st 2nd 3rd
Average Variations Variations (%)
Perimeter Defense Station 89.85 89.17 89.78
89.6 0.37 0.41%
Hangar Perimeter 85.79 84.48 83.2
84.49 1.3 1.54%
Interior Hangar 90.98 93.23 90.39
91.53 1.5 1.63%
The Nexus 79.41 75.29 77.39
77.36 2.06 2.66%
MCC Landing Site 79.11 72.29 70.92
74.11 4.39 5.92%

all numbers are in frame rates (except for percentages)

Only the Perimeter Defense Station results are pretty close to our initial results. The other levels are roughly 2 to 6 frames lower. Remember, we wanted a worst case scenario that really pushes the graphics card. The good news is the results from these levels are repeatable - only the MCC Landing Site level have slightly larger variations between runs. The Nexus level remains the one with the second lowest average frame rate.

Alright, now we're going to see whether or not there are differences between frame rate in actual gameplay and a timedemo run. Since we're measuring frame rate in gameplay testing sessions, we are playing the game with all the bells and whistles on - AI, sound, physics etc. Differences with timedemo runs are expected, but it sure is good to see just a small difference between gameplay testing sessions and timedemo results. In the table below, you'll see the average of three runs compared to a timedemo run for each of the levels we've picked. Just to make sure FRAPS is working correctly, we've included results from both Quake 4 internal timedemo statistics and FRAPS.

These results are taken at 1024 x 768, without AA and AF.


Gameplay FRAPS benchmark internal

Difference Difference (%)
Perimeter Defense Station 89.6 104.31 104.2
14.66 16.36%
Hangar Perimeter 84.49 91.37 91.5
6.95 8.22%
Interior Hangar 91.53 88.36 88.45
-3.13 -3.42%
The Nexus 77.36 77.72 77.5
0.25 0.32%
MCC Landing Site 74.11 68.48 68
-5.87 -7.92%

all numbers are in frame rates (except for percentages)

We'll that's - weird. We usually expect timedemo results to be higher than gameplay testing sessions results with the absence of sound (and some texture errors). But we sure didn't expect timedemo results to actually be lower. From the looks of it, the only level that stays fairly close is The Nexus The results from the other levels is too different (or too strange) from a timedemo run. It's probably safe to say that you shouldn't rely only on timedemo results as a measure of actual gameplay performance with Quake 4. Timedemo results can actually be very different to frame rates in actual gameplay.

After more detailed testing and some additional research, this weird behavior with Quake 4 timedemo is actually not weird at all. Timedemo don't run AI, sound, physics routines, they're straight graphical instructions for the graphics cards to render the game's graphics. That means it's very close to a pure graphics stress test, though processor, memory and bus speeds / bandwidth still have some influence on the results.  By taking out the burden of AI, sound, and physics from the processor, the processor can focus most of it's processing power to supplying the graphics cards, which in turn tries to render the game as fast as it can - hence the word 'timedemo'. In our case, higher results from certain levels indicate that we are system limited in those levels. Vice versa, a lower results may indicate we are graphically limited (pay attention to the word 'may').

So, out of the five levels we've picked, the choice comes down to two - Interior Hangar and The Nexus. We've decided to choose the Nexus for gameplay testing sessions - the primary reason is the results indicate that this level have close to equal load on both the system and the graphics subsystem. There are other points in favor of The Nexus level - it lacks a checkpoint save which makes it ideal to get minimum frame rates thus no hard disk writes that will cause a stutter or a sharp drop in frame rates. After all, we believe a lower minimum fps is more of a concern than a lower average fps. Generally speaking, average fps from The Nexus are pretty close to the lowest frame rates we've seen throughout the entire game, making it quite a perfect fit for a worst case scenario.

Serious Sam II

This game is quite possibly the longest of the three - 41 levels in total. There are some pretty funny, albeit dry jokes in it  like, "These rocks can shake, but I won't Quake." or even "I love collecting big guns. They're so Unreal." Hmm, what's a matter Croteam? No Half-Life jokes? The last level, Mental Institution, was quite  probably the longest level we have ever played in an FPS. Even the guys at Croteam put an estimated playtime of 1 hour for this level. Wonder what they will call the sequel: Serious Sam II, well, 2? I know Serious Sam II: Double the Fun!!! Oh, well - here are the results of many long hours of playing Serious Sam II. 


1024 x 768 1280 x 960 1600 x 1200
Performance Hit 1280 Performance Hit 1600
Jungle 92.26 91.01 82.37
1.35% 10.72%
Riverdance 89.28 84.3 82.59
5.58% 7.49%
M'Keke Village 84.05 92.44 82.99
-9.98% 1.26%
Road to Ursul 70.58 75.21 67.91
-6.56% 3.78%
Ursul Suburbs 83.98 84.99 80.84
-1.20% 3.74%
Kukulele Prison 80.78 81.31 72.89
-0.66% 9.76%
Ursul Gardens 75.77 78.25 77.64
-3.28% -2.47%
Kwongo 104.6 89.58 81.84
14.36% 21.76%
Deadwood 86.62 86.06 84.88
0.65% 2.01%
Forsaken Compound 74.92 75.81 81.69
-1.19% -9.04%
Branchester 81.44 78.36 78.31
3.78% 3.84%
Woodstock 95.39 96.54 95.27
-1.21% 0.12%
Giant Junkyard 56.41 60.46 62.45
-7.18% -10.69%
Zum-Zum 108.23 95.88 128.07
11.41% -18.34%
Hong-Pong 83.67 79.72 68.5
4.72% 18.13%
Wheels of Fortune 65.19 64.18 62.74
1.55% 3.75%
Chanolin Chambers 105.79 108.7 97.54
-2.76% 7.79%
Chanolin Temple 88.7 83.77 74.2
5.55% 16.34%
Prince Chan 99.5 90.52 91.94
9.02% 7.60%
Unreal Wasteland 113.9 115.69 113.48
-1.57% 0.37%
Doomed Canyon 135.17 129.42 129.16
4.26% 4.45%
Boneyard 132.35 123.6 118.57
6.61% 10.41%
Count Kleerofski 159.96 98.21 80.21
38.60% 49.86%
Greendale 81.5 81.87 83.09
-0.46% -1.95%
Coast to Cost 80.63 85 88.97
-5.42% -10.35%
Kingsburg 90.83 91.75 91.98
-1.01% -1.26%
Royal Sewers 141.35 140.95 138.95
0.28% 1.70%
Castle of Rock 108.3 109.12 97.33
-0.75% 10.13%
Floaterra 86.17 95.8 92.63
-11.18% -7.50%
Cecil the Dragon 170.89 179.86 162.56
-5.25% 4.87%
Area 5100 96.6 101.97 97.39
-5.56% -0.82%
Command Center 110.49 95.93 100.66
13.18% 8.90%
Freezepad 104.03 106.38 9.54
-2.26% 90.83%
Frostpost 78.14 83.63 80.9
-7.03% -3.54%
Hugo the Robot 124.22 119.96 109.33
3.43% 11.98%
Siriuspolis Uptown 74.86 83.21 60.67
-11.15% 18.95%
Shield Generator 59.74 60.39 59.52
-1.08% 0.38%
Be Quick of Be Dead 97.31 93.71 97.14
3.70% 0.17%
Welcome to the Jungle 87.51 95.28 83.55
-8.88% 4.53%
Jump in the Fire 125.03 106.64 98.95
14.71% 20.86%
Siriuspolis Downtown 78.64 83.57 78.87
-6.27% -0.29%
Mental Institution 103.81 96.63 85.52
6.91% 17.61%
Average (entire game) 96.87 94.66 88.42
1.14% 7.33%

all numbers are in frame rates (except for percentages)

The entire game seems to be quite system limited. It would've been a much more different picture with a faster processor or a slower graphics card but we didn't have either of them in hand so these will have to do.  Average performance hit for the entire game at 1600 x 1200 is just around 7 percent, that's just how system limited this game is in our test setup. Obviously, we have to pick the more graphically intensive levels, where the performance hit is substantially larger than that. We didn't pick the Boss levels (Kwongo, Count Kleerofski) since they're pretty small and not really your typical Serious Sam II level. You can see the top 5 levels below.


1024 x 768 1280 x 960 1600 x 1200
Performance Hit 1024 to 1280 Performance Hit 1024 to 1600
Jump in the Fire 125.03 106.64 98.95
14.71% 20.86%
Siriuspolis Uptown 74.86 83.21 60.67
-11.15% 18.95%
Hong-Pong 83.67 79.72 68.5
4.72% 18.13%
Mental Institution 103.81 96.63 85.52
6.91% 17.61%
Chanolin Temple 88.7 83.77 74.2
5.55% 16.34%
Average
95.21 89.99 77.57
4.15% 18.38%

all numbers are in frame rates (except for percentages)

Average frame rates does seem to differ from the entire game, ranging from 6 to 30 fps - a very big difference. The performance hit seems to indicate these levels are more graphically intensive than others, but would it be enough? Out of these levels, only Mental Institution have an average fps that's close to what we've seen throughout the entire game. Personally, we would rather not use this level - it's just too damn long. From a performance penalty perspective, the Hong-Pong level is fairly close to the average performance penalty, followed by Chanolin Temple.  Now, let's see if these numbers are repeatable.

These results are taken at 1024 x 768, without AA and AF.


1st 2nd 3rd
Average Variations Variations (%)
Jump in the Fire 90.75 92.95 91.86
91.86 1.1 1.20%
Siriuspolis Uptown 71.21 67.78 66.42
68.47 2.47 3.61%
Hong-Pong 73.6 74.47 73.04
73.7 0.72 0.98%
Mental Institution 100.68 99.47 100.43
100.2 0.64 0.64%
Chanolin Temple 87.88 85.47 82.79
85.38 2.55 2.98%

all numbers are in frame rates (except for percentages)

We tried to play these levels in a similar manner each time and by doing so, we've managed to keep variations pretty low. However, we can also see actual frame rates are much lower than our initial run. Only Mental Institution's and Chanolin Temple's results are fairly close to our initial results. However, Chanolin Temple seems to have a slightly higher variation between runs than Hong-Pong and Mental Institution. For our sanity's sake, let's hope it would be either Chanolin Temple or Hong-Pong, preferable Hong-Pong since it's much shorter and reacts much like other levels with resolution changes.

OK. Time to see if we could rely on replay instead of playing the game manually by hand. Will frame rates from a replay be indicative of actual frame rates in gameplay? Serious Sam II don't have a timedemo, but it does have a replay that can be used as a benchmark tool with the proper console command. We record a replay for each level, then we played that replay, measuring performance both with Serious Sam II internal benchmarking tool and FRAPS. We did this just to make sure FRAPS is recording and reporting accurately. Here are the results for each level.

These results are taken at 1024 x 768, without AA and AF.


Gameplay FRAPS benchmark internal
Difference Difference (%)
Jump in the Fire 91.86 86.84 87.8
-4.53 -4.94%
Siriuspolis Uptown 68.47 56.27 56.75
-11.96 -17.47%
Hong-Pong 73.7 68.58 69.55
-4.64 -6.29%
Mental Institution 100.2 90.36 91.1
-9.47 -9.45%
Chanolin Temple 85.38 72.54 73.45
-12.38 -14.50%

all numbers are in frame rates (except for percentages)

Ouch. Another strange behavior, but thankfully more predictable this time. Looks like replay results are lower than gameplay, making them poor not representative of frame rates in actual gameplay. The difference between replays and gameplay testing sessions can even be significant, so it's definitely not just normal variations between runs. Remember, variations between runs are fairly small with these levels (0.64 - 3.61 percent). In all honesty, our option are now limited to two levels - Jump in the Fire and Hong-Pong - two levels that behave similarly in replays and gameplay testing sessions. Out of the two, we pick Jump In The Fire because it's shorter.

We've seen the results from three newer games in our benchmark suite and it's not pretty. Out of the three, only F.E.A.R performance test offer results that are close enough to frame rate in actual gameplay. Quake 4 timedemo is not only broken (misplaced textures, fonts), but also behave rather erratically with some levels. Serious Sam II replay incurs quite a significant overhead in some cases, offering results that are actually lower than frame rate in actual gameplay.

It's beginning to look like timedemo and replays can not really represent what kind of frame rate you can expect from a game, even with a developer supplied demo / replay. However, that doesn't they're completely useless. They can still provide us some idea of how a game will behave in graphically limited situations, just not exact level of frame rates in actual gameplay. For a more representative test results, actual gameplay testing sessions is a better tool. Now that we have our candidates for gameplay testing, let's see how do they stand up in a real performance test.

Performance

AA and AF settings are applied in game, whenever possible. In Quake 4, we enable both AA and AF from the console, with the command 'r_multisample' set to 4 and ;'image_anisotropy' set to 16. For gameplay testing, we removed the frame rate cap by entering the command 'com_fixedtic 1'. V sync was disabled both from inside the game and on the driver panel. AA and AF is enabled from the settings menu for F.E.A.R and Serious Sam II

F.E.A.R settings are set to our usual test settings (all settings set to maximum / on except for soft shadows) and volumetric lights was turned on. Quake 4 was set to 'High Quality' while Serious Sam II was tested with our usual standard settings (maximum settings but with custom resolution, anti aliasing and anisotropic filtering settings).

The graphs are pretty self explanatory, but in case you can't see the text, they're arranged in groups of three colors: green for minimum fps, blue for average fps and red for maximum fps. The first group represent test results under default settings, and the second represent results with 4xAA and 16xAF enabled. These results are in frame rates per second.

F.E.A.R, 1024 x 768
Afterimage 1
Afterimage 2
FEAR Performance Test
44
109.417
329
30 (4x AA 16x AF)
70.906 (4x AA 16x AF)
291 (4x AA 16x AF)
.
38
100.802
259
30 (4x AA 16x AF)
72.36 (4x AA 16x AF)
209 (4x AA 16x AF)
.
40
106.719
276
38 (4x AA 16x AF)
72.411 (4x AA 16x AF)
176 (4x AA 16x AF)
F.E.A.R, 1280 x 960
Afterimage 1
Afterimage 2
FEAR Performance Test
43
89.428
313
25 (4x AA 16x AF)
53.921 (4x AA 16x AF)
211 (4x AA 16x AF)
.
38
88.491
211
23 (4x AA 16x AF)
53.583 (4x AA 16x AF)
162 (4x AA 16x AF)
.
39
85.499
195
30 (4x AA 16x AF)
53.349 (4x AA 16x AF)
122 (4x AA 16x AF)
F.E.A.R, 1600 x 1200
Afterimage 1
Afterimage 2
FEAR Performance Test
26
64.178
239
16 (4x AA 16x AF)
36.509 (4x AA 16x AF)
137 (4x AA 16x AF)
.
30
64.048
179
14 (4x AA 16x AF)
37.001 (4x AA 16x AF)
117 (4x AA 16x AF)
.
37
64.263
132
20 (4x AA 16x AF)
37.216 (4x AA 16x AF)
83 (4x AA 16x AF)


1024 x 768 1280 x 960 1600 x 1200
Performance Hit 1280 Performance Hit 1600
Afterimage 103.9 87.13 63.57
16.14% 38.82%
FEAR Performance Test 106.72 85.5 64.26
19.88% 39.78%
Average (entire game) 120.45 98.91 73.75
17.76% 38.69%
Average (top 5 graphical levels) 128.72 102.5 76.3
20.34% 40.69%

1024 x 768 ;AA AF 1280 x 960 ;AA AF 1600 x 1200 ;AA AF


Afterimage 73.23 52.61 36.67
28.16% 49.93%
FEAR Performance Test 72.41 53.35 37.22
26.32% 48.60%

all numbers are in frame rates (except for percentages)

By not including checkpoint saves in our gameplay testing runs, we were able to get a pretty repeatable minimum frame rate for the Afterimage level. Although the F.E.A.R Performance Test was pretty much on the mark with average fps, it's a slightly hit and miss affair on the minimum fps. Take a close look at 1024 x 768, AA / AF results. On the second part of Alterimage, the second firefight was graphically intensive enough to reach a minimum fps of 30 fps. while the F.E.A.R Performance Test 'only' dropped to 38 fps. That's quite a huge difference - 25 percent. We also saw a similar difference at 1600 x 1200 without AA  and AF. To be fair, the F.E.A.R Performance Test was pretty close at 1024 x 768 and 1280 x 960 without AA and AF, so it's obviously good enough for quick benchmarks, but not for more detailed information.

On the subject of representing the entire game, F.E.A.R Performance Test does a pretty good job in showing a worst case scenario. Yes, frame rates are lower than what you will be seeing throughout the entire game, but the performance penalty is similar. We didn't run the entire game without AA and AF, but we think that the difference between gameplay testing sessions for the entire game and F.E.A.R Performance Test should be minimal. There will be levels that have lower frame rates than Afterimage and F.E.A.R performance test, so we think it would be wise to keep minimum frame rate close to 40 fps, if possible. Looking back to our various reviews, we recommend a slightly higher clocked GeForce 7900GT / 7800GTX or Radeon X1900XTX / X1900XT and equivalent to play at our test settings at 1024 x 768, 4x AA and 16x AF. Higher resolutions is only recommended with faster graphics card(s).

Quake 4, 1024 x 768
Data Processing Terminal, Gameplay
Data Processing Terminal, Timedemo
The Nexus, Gameplay
The Nexus, Timedemo
66
119.244
169
66 (4x AA 16x AF)
98.445 (4x AA 16x AF)
146 (4x AA 16x AF)
.
62
111.895
162
62 (4x AA 16x AF)
106.205 (4x AA 16x AF)
132 (4x AA 16x AF)
.
49
86.195
323
39 (4x AA 16x AF)
63.901 (4x AA 16x AF)
211 (4x AA 16x AF)
.
31
80.613
272
30 (4x AA 16x AF)
70.031 (4x AA 16x AF)
166 (4x AA 16x AF)
Quake 4, 1280 x 1024
Data Processing Terminal, Gameplay
Data Processing Terminal, Timedemo
The Nexus, Gameplay
The Nexus, Timedemo
67
117.006
161
66 (4x AA 16x AF)
98.445 (4x AA 16x AF)
146 (4x AA 16x AF)
.
63
111.935
162
54 (4x AA 16x AF)
82.149 (4x AA 16x AF)
113 (4x AA 16x AF)
.
42
81.4
255
29 (4x AA 16x AF)
44.307 (4x AA 16x AF)
145 (4x AA 16x AF)
.
31
78.869
206
31 (4x AA 16x AF)
56.56 (4x AA 16x AF)
117 (4x AA 16x AF)
Quake 4, 1600 x 1200
Data Processing Terminal, Gameplay
Data Processing Terminal, Timedemo
The Nexus, Gameplay
The Nexus, Timedemo
61
106.021
158
32 (4x AA 16x AF)
50.61 (4x AA 16x AF)
89 (4x AA 16x AF)
.
60
105.048
146
37 (4x AA 16x AF)
61.439 (4x AA 16x AF)
92 (4x AA 16x AF)
.
41
70.439
197
20 (4x AA 16x AF)
32.837 (4x AA 16x AF)
103 (4x AA 16x AF)
.
23
67.273
158
22 (4x AA 16x AF)
42.651 (4x AA 16x AF)
89 (4x AA 16x AF)


1024 x 768 1280 x 1024 1600 x 1200
Performance Hit 1280 Performance Hit 1600
Data Processing Terminal Gameplay 119.12 117.66 105.76 1.22% 11.22%
Data Processing Terminal Timedemo 109.96 111.43 107.89 -1.34% 1.89%
The Nexus Gameplay 85.31 81.93 71.2
3.97% 16.54%
The Nexus  Timedemo 80.65 78.98 67.11 2.07% 16.78%
Average (entire game) 88.98 86.64 81.68
2.59% 7.83%
Average (top 5 graphical levels) 87.5 84.8 72.53
2.94% 17.33%

1024 x 768 AA AF 1280 x 1024 AA AF 1600 x 1200 AA AF


Data Processing Terminal Gameplay 98.08 69.01 50.68 29.64% 48.32%
Data Processing Terminal Timedemo 106.46 82.51 61.45 22.49% 42.28%
The Nexus Gameplay 64.13 44.02 32.46
31.37% 49.39%
The Nexus Timedemo 70.18 56.34 42.15 19.72% 39.94%

all numbers are in frame rates (except for percentages)

Since these two levels do not have checkpoint saves, we were also able to get a pretty repeatable minimum frame rate for both Data Processing Terminal and The Nexus. We saw earlier how gameplay testing sessions and timedemo runs results can be very  different and here we see it rears it's ugly head again. At 1024 x 768, without AA / AF, we have a higher minimum fps and average fps in gameplay testing sessions, while with AA and AF we have a higher minimum fps and lower average fps than timedemo runs in Data Processing Terminal. The same applies to The Nexus at 1024 x 768.

At 1280 x 1024 without AA and AF, gameplay testing sessions results are still generally higher than timedemo runs, but at 1600 x 1200 they're pretty much in the same in Data Processing Terminal. With AA and AF, 1280 x 1024 and 1600 x 1200 behave differently - gameplay testing sessions results are higher at 1280 x 1024 and timedemo results are higher at 1600 x 1200. The Nexus is a different matter - timedemo results are slower without AA / AF and faster with AA / AF at both 1280 x 1024 and 1600 x 1200. We ran all these tests 3 times and while there are very small variations between runs (1 to 3 fps), variations pretty much disappear as the graphical load gets heavier.

These facts lead us to an interesting conclusion: by comparing results from timedemo runs and gameplay testing sessions, we were able to pinpoint at what resolution the game (or a particular level) becomes graphically limited instead of system limited. In Data Processing Terminal, the crossover point is 1600 x 1200 without AA and AF. With AA and AF enabled, the crossover point happens a lot sooner for both levels - below 1024 x 768. The Nexus still remains system limited at 1600 x 1200 without AA and AF, but it too became graphics limited at 1024 x 768. As a quick performance test, we think a gameplay testing session on the Data Processing Terminal is still valid as a graphical benchmark because performance hit is still very high, though we had to admit the hit is higher than timedemo runs which means system limitations does add to the hit here. However, the frame rates we got from this method is closer to what you may actually see when you're playing the game - Quake 4 has a frame rate cap of 60 fps by default.

On the subject of representing the entire game (with a gameplay testing session), we think The Nexus is more representative than Data Processing Terminal. Frame rates are generally in line to what we saw through the entire game, particularly in levels that are graphically intensive, which sits nicely as a worst case scenario. Performance penalty hit rate is slightly different than timedemo runs, but close enough that any difference can be considered normal variations between runs. There are levels that have lower frame rates than The Nexus, but these levels are quite possibly system limited with our setup, so a faster processor than our Athlon 3500+ will be more useful in this case. For graphics cards, like F.E.A.R, we recommend a slightly higher clocked GeForce 7900GT / 7800GTX or Radeon X1900XTX / X1900XT and equivalent to play at our test settings at 1024 x 768, 4x AA and 16x AF. Higher resolutions is only recommended with faster graphics card(s).

Serious Sam II, 1024 x 768
Greendale, Replay
Jump in the fire, Replay
Jump in the fire, Gameplay
46
59.274
83
44 (4x AA 16x AF)
57.408 (4x AA 16x AF)
76 (4x AA 16x AF)
.
54
90.416
151
52 (4x AA 16x AF)
88.073 (4x AA 16x AF)
145 (4x AA 16x AF)
.
62
96.025
181
58 (4x AA 16x AF)
96.639 (4x AA 16x AF)
175 (4x AA 16x AF)
Serious Sam II, 1280 x 960
Greendale, Replay
Jump in the fire, Replay
Jump in the fire, Gameplay
45
58.941
82
44 (4x AA 16x AF)
49.12 (4x AA 16x AF)
59 (4x AA 16x AF)
.
53
89.593
146
53 (4x AA 16x AF)
87.927 (4x AA 16x AF)
144 (4x AA 16x AF)
.
57
95.215
194
58 (4x AA 16x AF)
94.239 (4x AA 16x AF)
181 (4x AA 16x AF)
Serious Sam II, 1600 x 1200
Greendale, Replay
Jump in the fire, Replay
Jump in the fire, Gameplay
44
57.419
74
31 (4x AA 16x AF)
34.734 (4x AA 16x AF)
43 (4x AA 16x AF)
.
53
88.969
143
46 (4x AA 16x AF)
77.786 (4x AA 16x AF)
123 (4x AA 16x AF)
.
57
94.731
174
44 (4x AA 16x AF)
76.608 (4x AA 16x AF)
123 (4x AA 16x AF)


1024 x 768 1280 x 960 1600 x 1200
Performance Hit 1280 Performance Hit 1600
Jump in The Fire Gameplay 96.03 95.22 94.73
0.84% 1.35%
Jump in The Fire Replay 90.42 89.59 88.97
0.91% 1.60%
Greendale Replay 59.27 58.94 57.42
0.56% 3.13%
Average (entire game) 96.87 94.66 88.42
1.14% 7.33%
Average (top 5 graphical levels) 95.21 89.99 77.57
4.15% 18.38%

1024 x 768 AA AF 1280 x 960 AA AF 1600 x 1200 AA AF


Jump in The Fire Gameplay 96.64 94.24 76.61
2.48% 20.73%
Jump in The Fire Replay 88.07 87.93 77.79
0.17% 11.68%
Greendale Replay 57.41 49.12 34.73
14.44% 39.50%

all numbers are in frame rates (except for percentages)

It may be hard to see from the graph, but the results in the table shows us that this game is system limited without AA and AF. Even with AA and AF, we only saw a change in average fps once we hit 1600 x 1200 with the Jump in the fire level. As benchmarking goes, we actually have a more informative benchmark on the developer supplied Greendale replay, simply because it scales better. So, we really didn't learn anything with this game beside the fact that replays produces lower frame rates than what you will see in actual gameplay

Conclusion:

So are timedemo and replay useless? After all, they are not able to provide results that are close to frame rate in actual gameplay. We do not think so. Unlike gameplay testing sessions runs, timedemo runs and replays are very repeatable and are very good in pointing out performance influencing factors and bottlenecks (graphics cards, processor, memory, disk writes / reads). Quake 4 really illustrate this point - most of the time you play in this game you're bottlenecked by the processor and not the graphics card with a similar setup like ours.

Also remember that gameplay testing sessions will likely have a higher variation between runs than a timedemo, simply because each run will be different in some way. Though you can minimize the effect it has on average frame rates, it would be very hard to compare per second frame rates. Since there's always the chance of higher normal variations between each runs with gameplay testing sessions, we can draw the wrong conclusions from this method of performance testing. Our experience with Serious Sam II is an example of this. The differences between results at 1024 x 768 and 1280 x 960 we got from our initial runs was actually normal variations between runs and it's only after we perform more rigorous testing do we see the entire game is very likely system limited in our setup.

Running a benchmark with a timedemo / replay is also quicker and way easier than doing a gameplay testing session. By carefully doing research and picking a level that mostly stress one component or even two components, we can get a good estimate of the average fps that's pretty close to what we can get with a real gameplay testing session. The added bonus of this is we also get a pretty good idea which level represent frame rates we will see throughout the entire game.

Our experience doing gameplay testing sessions and timedemo runs / replay teaches us that every performance test tool is not perfect. Each have their own advantages and disadvantages. Both should be used to try and 'capture' what kind of performance a hardware can offer for a particular game. However, even using both doesn't mean we will be able to capture actual gameplay frame rates in its entirety. Remember, we made a subjective decision to play the game in a particular way that will be graphically intensive and produce repeatable results. We doubt real life gaming sessions from different gamers with different styles of play will be the same as ours. However, we're pretty sure that the results portray a worst case scenario - all things being equal, you should see slightly higher frame rates in these games with a similar setup.

Another thing to note is that these results are taken from the singleplayer levels of these games. Multiplayer levels are usually 'lighter' in graphical intensity, particularly since developers have no real control (no scripted events) for each room / corridor in a multiplayer level and stick to a polygon / texture / shader budget. There are also mods to consider - some mods, like the Excessive mod for Quake fans will likely put more strain on graphics cards than the default mode of play.

Then there is a trend of more and more games are becoming 'open ended' and using free roam environments. With these new games (The Elder Scrolls: Oblivion, Need for Speed: Most Wanted, Grand Theft Auto etc), it will difficult if not impossible to reprensent gameplay with gameplay testing sessions or even try to cover every type of environment that's available in the game. In these cases, reviewers usually aim for a worst case scenario and conduct gameplay testing sessions in a scene / environment that's more graphically intensive than others. Of course, that brings us to the first problem - results of these types of gameplay testing sessions will not reflect frame rates you will encounter in the game.

Go to top
Disclaimer and Privacy policy.