James Antill - Lies, damn lies, and benchmarks
Jul. 31st, 2009
02:47 pm - Lies, damn lies, and benchmarks
Or why I hate "quick benchmarks"...
Recently I've started to see a lot more of what I'd call "quick benchmarks", often it's to do with yum but mainly I think that's because those tend to get sent to me by multiple methods. So I thought I'd try and write down why I react so negatively/dismissively to them, how people can spot the underlying problems that annoy me and even better some advise on how you can go about doing some real benchmarks if that kind of thing interests you (but it's much more work than quick benchmarks).
The summary of the problem is that quick software benchmarking often involves taking a huge amount of differences between two applications and have a single number result. Then you compare just the numbers, and come to a conclusion. So X gets 3 and Y gets 5 for problem ABCD ... therefore Y is 66% worse than X at ABCD. Except that might be a highly misleading (or worse) conclusion, for a number of reasons:
This is often the route cause of all problems in quick benchmarks, and because of it's pervasiveness in human nature I would say that a significant portion of the work in doing a good benchmark is making sure you've tried to reduce the effect of your own bias. The bias happens in many forms, from the simple fact that when testing X and Y you might have a deeper understanding of one and so test it's strengths; make less configuration errors; or even assume favourable results are correct but unfavourable ones are incorrect.
An old example
Probably the first experience I had of this was at the end of the 1990s, I can find a link to the original "discussions" now but I can relate the main points. Linux had supported SMP (running on 2 or more CPUs at once) for a couple of years and FreeBSD was just starting to look at it seriously. While doing the Linux work the kernel developers had created a simple benchmark of rebuilding the kernel with different "make -j" configurations (controlling how parallel make), probably because it's simple tests a bunch of things at once and affects kernel developers personally. Obviously it was natural for the FreeBSD developers to do the same kind of thing, when they were doing the same kinds of work. I found what I think is the original FreeBSD report that someone posted on this FreeBSD SMP - kernel compilation bench.
Note that both of the above benchmarks were meant to be tested against themselves (ie. the developer would run the benchmark, make some changes and then run the benchmark again on the new kernel with the changes and see the difference). And, again, they were great for that ... pretty much all developers have similar kinds of test runs that they run. However someone then decided to take "the result for Linux" and compare it to "the result for FreeBSD", and came to the conclusion that because the FreeBSD number was smaller than the Linux number then that meant "FreeBSD was already faster at SMP than Linux". This conclusion was repeated by a number of knowledgable FreeBSD developers before a Linux developer pointed out the obvious fact that the tests didn't just have differenet kernels, but different source trees were being compiled with different compilation toolchains.
Again, I don't think the FreeBSD developers intentionally meant to create or use bogus data just that given some random numbers implying that FreeBSD was better than Linux they were biased to believe them and so didn't think about it too much.
A couple of yum examples
I've seen a lot of people test "yum makecache" vs. "apt-get update" or "smart update" or "zypper refresh". This makes some sense from the point of view of an apt developer/user because this is an operation that the user has to run before any set of operations to make sure the database is upto date, so any improvement (relative to older versions of your self) is a significant win. However a yum user is unlikely to ever run this command because the default mode of operation is for yum to manage database synchronization.
Another common problem is to compare something like "apt-cache search" / "apt-file search" against "yum search" / "yum provides", the problem here is that apt-cache / apt-files just do a simple grep on the cache of available packages ... yum does it's searches against what yum can see use (so, for example, versionlock'ing a package will affect the results you get in yum ... as will installing packages).
These problems should be obvious after even a moments thought about how yum is used, but again most of the people publishing this kind of results either don't use yum or are expecting it to be slower and so don't think about results which confirm that expectation.
Measuring the slow thing
Many developers are aware that their applications have a "fast path" and a "slow path", which normally align with "unexpected cases work, but are slow" and "expected, normal, cases work quickly". Benchmarkers often do not know these distinctions. I've seen numerous cases where someone benchmarks something that wouldn't be done in real life. Alas. this is hard to spot for the normal user, and is one of the reasons that it's wise to speak with the developers of whatever you are benchmarking.
Benchmarking 1 point and then concluding about N
The obvious yum example
By far the most common problem here is people run a simple benchmark like "time (echo n | yum update)" compared to the same operation in apt or zypp and then concluding X is x% faster at updating than Y. The problem here is that updating actually has a number of operations (breaking them down roughly):
- read repo. configuration, cmd line options etc.
- check if repo. metadata is current.
- download repo. metadata if not current.
- merge configured repo. metadata into "available".
- read current rpm metadata.
- depsolve updates+obsoletes+etc. from rpm and repo. metadata.
- output changes.
- confirm changes.
- download updates.
- install updates.
However even though there are 10 operations above, the "simple update benchmark" only really tests "6. depsolve ...". Which is fine on it's own, it's worth knowing what the depsolve time is (and I run this benchmark myself with yum to find out that info.) but depsolve time is not the same as update time. In fact most of the time it's in the noise for the update operation as a whole (obviously saying your update takes 25% of the time is much more fun than saying it takes 98% of the time).
I've even seen problems here where they'll test apt on Debian vs. yum on Fedora, which much like the FreeBSD vs. Linux test changes so many different things that you can't conclude anything about just the package managers. There have also been cases where someone tested "apt upgrade" vs. "yum upgrade" but the yum operation also download repo. metadata, as it wasn't current/available.
Benchmarking N points and then concluding about 1
The FreeBSD vs. Linux point
As I said above the technical problem with the FreeBSD make kernel vs. Linux make kernel problem was that it was testing 3 different sets of things (buildtools, source code, kernel) and drawing a conclusion about only 1 of them.
Phoronix has many examples here
Phoronix has many examples of all the different problems you can get into here. From things like their "apache benchmark", which turned out to have nothing to do with Apache on either platform they tested it on, and mp3 encoding "filesystem benchmarks". To their usual "conclusions" which take 10-20 completely separate points, with completely separate error rates, and come up with an "average".
The obvious yum example
The usual way this happens in package management "benchmarking" is that a random number of commands will be tested, like install; update; remove; whatprovides; search; list ... and then a final score will be given. This completely ignores the fact some of those commands are being run much more often than others, and in different situations.
Standard deviation is always significant
The simple way to think of this is that multiple runs of anything you test need to be performed, and if you don't have results where the answer is "X and Y perform the same" (even though the absolute numbers are never going to be identical) ... the benchmarking has probably screwed up the standard deviation.
Phoronix has many examples here
This is often combined with the previous problem, if you test X vs. Y and get the results of 1000 and 1200 then there's a huge difference between a SD (standard deviation) of 500 and an SD of 1 on those results. And this is esp. important when you have another result which is 10 vs. 20 and an SD that could be 0.1 or 5.
The obligatory yum point
I never see benchmarks of package management operations that include a standard deviation value, and this can be esp. important when different values and/or different configurations can affect the performance significantly.
What to do, if you want to create a real benchmark
All of the above is not to try to discourage real benchmarking, as this is a time consuming and unrewarding task (IMO), but often bad benchmarking is worse than none at all. If you aren't up for that then feel free to suggest usecases that seem difficult/slow/etc., even if you need to explain it as "currently I run X like this, and it takes N time, how do I do something similar with Y in a similar amount of time" ... this is not the same thing, because the answer might well be "run Y like this instead". But if you are up for the challenge then...
As I said earlier, if you are benchmarking X and Y and comparing them then you need to be very familiar with both X and Y, which means investing a significant amount of time working with both X and Y. This very likely means that you want to speak to developers of both X and Y, both about the measurements you are trying to take and the results you are getting. I am always very suspicious if the benchmarking seems to be one sided (in that more knowledge was available about X and Y, at the time of testing), or more to the point I'm very suspicious if you are benchmarking yum but I've no idea who you are.
Also, even if you create a good benchmark which shows X is 999x faster than Y for operation FOO, and FOO is a useful feature that people will want to take advantage of. It is likely that the developers for Y will be able to change Y (sometimes trivially, as it might be an assumed edge case or regression noone had hit, yet) which reduces the difference significantly. This also often means that doing benchmarks properly "just" makes both X and Y better, with the published results showing they are within 5% or whatever and so "nobody cares" about the benchmarks. Which makes running the benchmarks pretty unrewarding, but hey you want to be the benchmarker not me... :)