Log in

No account? Create an account

Lies, damn lies, and benchmarks - James Antill — LiveJournal

Jul. 31st, 2009

02:47 pm - Lies, damn lies, and benchmarks

Previous Entry Share Next Entry

Or why I hate "quick benchmarks"...

Recently I've started to see a lot more of what I'd call "quick benchmarks", often it's to do with yum but mainly I think that's because those tend to get sent to me by multiple methods. So I thought I'd try and write down why I react so negatively/dismissively to them, how people can spot the underlying problems that annoy me and even better some advise on how you can go about doing some real benchmarks if that kind of thing interests you (but it's much more work than quick benchmarks).

The summary of the problem is that quick software benchmarking often involves taking a huge amount of differences between two applications and have a single number result. Then you compare just the numbers, and come to a conclusion. So X gets 3 and Y gets 5 for problem ABCD ... therefore Y is 66% worse than X at ABCD. Except that might be a highly misleading (or worse) conclusion, for a number of reasons:

Benchmarking Bias

This is often the route cause of all problems in quick benchmarks, and because of it's pervasiveness in human nature I would say that a significant portion of the work in doing a good benchmark is making sure you've tried to reduce the effect of your own bias. The bias happens in many forms, from the simple fact that when testing X and Y you might have a deeper understanding of one and so test it's strengths; make less configuration errors; or even assume favourable results are correct but unfavourable ones are incorrect.

An old example

Probably the first experience I had of this was at the end of the 1990s, I can find a link to the original "discussions" now but I can relate the main points. Linux had supported SMP (running on 2 or more CPUs at once) for a couple of years and FreeBSD was just starting to look at it seriously. While doing the Linux work the kernel developers had created a simple benchmark of rebuilding the kernel with different "make -j" configurations (controlling how parallel make), probably because it's simple tests a bunch of things at once and affects kernel developers personally. Obviously it was natural for the FreeBSD developers to do the same kind of thing, when they were doing the same kinds of work. I found what I think is the original FreeBSD report that someone posted on this FreeBSD SMP - kernel compilation bench.

Note that both of the above benchmarks were meant to be tested against themselves (ie. the developer would run the benchmark, make some changes and then run the benchmark again on the new kernel with the changes and see the difference). And, again, they were great for that ... pretty much all developers have similar kinds of test runs that they run. However someone then decided to take "the result for Linux" and compare it to "the result for FreeBSD", and came to the conclusion that because the FreeBSD number was smaller than the Linux number then that meant "FreeBSD was already faster at SMP than Linux". This conclusion was repeated by a number of knowledgable FreeBSD developers before a Linux developer pointed out the obvious fact that the tests didn't just have differenet kernels, but different source trees were being compiled with different compilation toolchains.

Again, I don't think the FreeBSD developers intentionally meant to create or use bogus data just that given some random numbers implying that FreeBSD was better than Linux they were biased to believe them and so didn't think about it too much.

A couple of yum examples

I've seen a lot of people test "yum makecache" vs. "apt-get update" or "smart update" or "zypper refresh". This makes some sense from the point of view of an apt developer/user because this is an operation that the user has to run before any set of operations to make sure the database is upto date, so any improvement (relative to older versions of your self) is a significant win. However a yum user is unlikely to ever run this command because the default mode of operation is for yum to manage database synchronization.

Another common problem is to compare something like "apt-cache search" / "apt-file search" against "yum search" / "yum provides", the problem here is that apt-cache / apt-files just do a simple grep on the cache of available packages ... yum does it's searches against what yum can see use (so, for example, versionlock'ing a package will affect the results you get in yum ... as will installing packages).

These problems should be obvious after even a moments thought about how yum is used, but again most of the people publishing this kind of results either don't use yum or are expecting it to be slower and so don't think about results which confirm that expectation.

Measuring the slow thing

Many developers are aware that their applications have a "fast path" and a "slow path", which normally align with "unexpected cases work, but are slow" and "expected, normal, cases work quickly". Benchmarkers often do not know these distinctions. I've seen numerous cases where someone benchmarks something that wouldn't be done in real life. Alas. this is hard to spot for the normal user, and is one of the reasons that it's wise to speak with the developers of whatever you are benchmarking.

Benchmarking 1 point and then concluding about N

The obvious yum example

By far the most common problem here is people run a simple benchmark like "time (echo n | yum update)" compared to the same operation in apt or zypp and then concluding X is x% faster at updating than Y. The problem here is that updating actually has a number of operations (breaking them down roughly):

  1. read repo. configuration, cmd line options etc.
  2. check if repo. metadata is current.
  3. download repo. metadata if not current.
  4. merge configured repo. metadata into "available".
  5. read current rpm metadata.
  6. depsolve updates+obsoletes+etc. from rpm and repo. metadata.
  7. output changes.
  8. confirm changes.
  9. download updates.
  10. install updates.

However even though there are 10 operations above, the "simple update benchmark" only really tests "6. depsolve ...". Which is fine on it's own, it's worth knowing what the depsolve time is (and I run this benchmark myself with yum to find out that info.) but depsolve time is not the same as update time. In fact most of the time it's in the noise for the update operation as a whole (obviously saying your update takes 25% of the time is much more fun than saying it takes 98% of the time).

I've even seen problems here where they'll test apt on Debian vs. yum on Fedora, which much like the FreeBSD vs. Linux test changes so many different things that you can't conclude anything about just the package managers. There have also been cases where someone tested "apt upgrade" vs. "yum upgrade" but the yum operation also download repo. metadata, as it wasn't current/available.

Benchmarking N points and then concluding about 1

The FreeBSD vs. Linux point

As I said above the technical problem with the FreeBSD make kernel vs. Linux make kernel problem was that it was testing 3 different sets of things (buildtools, source code, kernel) and drawing a conclusion about only 1 of them.

Phoronix has many examples here

Phoronix has many examples of all the different problems you can get into here. From things like their "apache benchmark", which turned out to have nothing to do with Apache on either platform they tested it on, and mp3 encoding "filesystem benchmarks". To their usual "conclusions" which take 10-20 completely separate points, with completely separate error rates, and come up with an "average".

The obvious yum example

The usual way this happens in package management "benchmarking" is that a random number of commands will be tested, like install; update; remove; whatprovides; search; list ... and then a final score will be given. This completely ignores the fact some of those commands are being run much more often than others, and in different situations.

Standard deviation is always significant

The simple way to think of this is that multiple runs of anything you test need to be performed, and if you don't have results where the answer is "X and Y perform the same" (even though the absolute numbers are never going to be identical) ... the benchmarking has probably screwed up the standard deviation.

Phoronix has many examples here

This is often combined with the previous problem, if you test X vs. Y and get the results of 1000 and 1200 then there's a huge difference between a SD (standard deviation) of 500 and an SD of 1 on those results. And this is esp. important when you have another result which is 10 vs. 20 and an SD that could be 0.1 or 5.

The obligatory yum point

I never see benchmarks of package management operations that include a standard deviation value, and this can be esp. important when different values and/or different configurations can affect the performance significantly.

What to do, if you want to create a real benchmark

All of the above is not to try to discourage real benchmarking, as this is a time consuming and unrewarding task (IMO), but often bad benchmarking is worse than none at all. If you aren't up for that then feel free to suggest usecases that seem difficult/slow/etc., even if you need to explain it as "currently I run X like this, and it takes N time, how do I do something similar with Y in a similar amount of time" ... this is not the same thing, because the answer might well be "run Y like this instead". But if you are up for the challenge then...

As I said earlier, if you are benchmarking X and Y and comparing them then you need to be very familiar with both X and Y, which means investing a significant amount of time working with both X and Y. This very likely means that you want to speak to developers of both X and Y, both about the measurements you are trying to take and the results you are getting. I am always very suspicious if the benchmarking seems to be one sided (in that more knowledge was available about X and Y, at the time of testing), or more to the point I'm very suspicious if you are benchmarking yum but I've no idea who you are.

Also, even if you create a good benchmark which shows X is 999x faster than Y for operation FOO, and FOO is a useful feature that people will want to take advantage of. It is likely that the developers for Y will be able to change Y (sometimes trivially, as it might be an assumed edge case or regression noone had hit, yet) which reduces the difference significantly. This also often means that doing benchmarks properly "just" makes both X and Y better, with the published results showing they are within 5% or whatever and so "nobody cares" about the benchmarks. Which makes running the benchmarks pretty unrewarding, but hey you want to be the benchmarker not me... :)


[User Picture]
Date:July 31st, 2009 08:59 pm (UTC)
For yum in particular...can you write up a howto on isolating each of the 10 steps for people doing valid bottleneck troubleshooting.

If I do see a significant slowdown, by enabling an optional plugin or feature (like repository scoring for example), having instructions on how to isolate which of the ten steps is seeing a bottleneck would be good.

In fact...you may even want to make sure you include the disableplugin step for completeness in the isolation instructions to help people self-determine if the problem is in a plugin codebase or in the base yum code.

(Reply) (Thread)
[User Picture]
Date:August 2nd, 2009 04:36 am (UTC)

Speed bugs are just bugs...


I'd advise finding speed bugs the same way I'd advise trying to find other bugs ... the most obvious step is to contact a yum developer (and def. not post to the closest mailing list :).

For instance when we recently pushed a regression which made "cost" go from seconds to minutes, a bunch of people spotted it (one even saying it worked fine if he commented out a particular line of code) and we eventually found the underlying problem (but it took me hours to work out what the problem was).

But I'm pretty sure a normal user wouldn't have been helped by detailed instructions on how to isolate parts of commands.

When I listed the "10 things the update command does" I didn't mean I thought it would be useful to time each of them, quite the opposite ... I meant it more that it was pretty suspicious to compare some of the operation, because a normal user will often end up doing all of the operation. Specifically if the rpmdb and repos. aren't in core, doing the IO to read hat data can easily dwarf the depsolve time and it's very rare for a user to not confirm the transaction (and again, the IO there will often be the longest part of the operation, by far). So even a 75% difference in depsolving can easily be only 1% of the update operation.

(Reply) (Parent) (Thread)
Date:August 1st, 2009 11:04 pm (UTC)
Well one point you missed here is this:

app X does foo slower than app Y does foo.

For the user this means that X is worse than Y, the details about the implementation that causes this is irrelevant here (Y uses a cache while X does not, or X performs this additional steps).

Such benchmarks are not really helpful in finding and fixing issues, but the fact that X is slower when doing foo _is_ a problem (be it inefficient code or bad/worse design decision does not matter for the user).

Take this as an example: app A starts 20% faster than app B, so people will complain that B is worse here, B might be just doing more tasks on startup, but the user does not care about this the end result is that B _is_ slower at startup.

(well ok startup is not really a "task" but you should get the point).
(Reply) (Thread)
[User Picture]
Date:August 2nd, 2009 04:50 am (UTC)

I agree with you

app X does foo slower than app Y does foo. For the user this means that X is worse than Y, the details about the implementation that causes this is irrelevant here (Y uses a cache while X does not, or X performs this additional steps).

If I said that I didn't mean to, by all means complain if "doing operation FOO" is significantly different. When I spoke about "yum update" the point there was that taking a small subset of the overall operation, measuring that, and then complaining about the full operation based on that small difference is misleading at best.

For example say a full operation of "yum update" takes 26 seconds, and using the same data (and getting the same result) "foo update" takes 22 seconds ... now let's say that the depsolving part of each takes 6 seconds and 2 seconds respectively. Now it's fair to say that the difference in time is likely due to the depsolver, and that yum is slower, but those 4 seconds are worth significantly less when you include the 20 other seconds needed for the operation.

Related to that, if you use something with manual synchronization like apt but don't include the "apt-get update" part then that is "unfair", and shouldn't be passed off as merely implementation. Because real users need to synchronize before doing an operation.

(Reply) (Parent) (Thread)