Multiple Comparisons and Poker Results

by Daniel Kimberg | Published: Nov 08, 2002

If you want an estimate of your hourly earning rate, the most readily available estimate, and probably the most useful, is just your total winnings (or losses) divided by your total number of hours played. So, if you've played 1,000 hours and you're up 1,000 big bets, you'd probably guess your earning rate is one big bet per hour. How accurate is that estimate? The standard error of the mean (or SEM) can be used to draw confidence intervals around your estimate. Two standard errors (2SEM) gives about a 95 percent confidence interval, meaning that for practical purposes, you can assume your true earning rate lies within 2SEM of your average result. About 2.3 percent of the time, your true rate will be outside that interval to the high side, and the same to the low side. Your SEM goes down as you play more hours, so you get a more accurate estimate of your earning rate with more data.

If we make some fairly mundane assumptions, such as a 10 big bet hourly standard deviation, we can use the following table to get a sense for how accurate your earning rate estimate will be after a given number of hours. The first column shows the number of hours, and the second shows you the SEM in big bets. The remaining columns, for convenience, show some other multiples of the SEM, some of which are useful only because they come up below. 2SEM is a reasonable standard. So, after 400 hours, your earning rate estimate may still be off by as much as a big bet in either direction.

GRAPH ONE

Even if you're winning more than a full bet an hour after 400 hours, you have to bear in mind that some proportion of break-even players (about 2.3 percent) will be up by that amount. After 100 hours, a series of 20 five-hour sessions, your earning rate is so tenuously connected to your results that you would be foolish to conclude that you're a winner or a loser without some fairly extreme results.

This brings me to the topic of this column: what to believe when someone tells you how well he's been doing. Suppose that instead of charting your own results for 100 hours, you read a post on the Internet newsgroup rec.gambling.poker (RGP) from a player who claims he's been beating up on your regular game for 100 hours, to the tune of six big bets an hour (with an hourly standard deviation of 10 big bets). Ignoring the distinct but irrelevant (for present purposes) possibility that the guy is lying, your first instinct might be that he just got lucky. Your second instinct might be to look at the above table and see that even after only 100 hours, a 6BB per hour earning rate is well outside the 95 percent confidence interval around zero. In other words, you should believe the guy is a winner.

Your third instinct, however, is the one you should go with. It's the one that asks the following question: How many other people are there out there just like this guy, but who didn't post? This turns out to be almost as important in evaluating this kind of statistical evidence. The reason is what's known as the multiple comparison problem.

The multiple comparison problem crops up whenever you perform a slew of statistical tests all at once. It's true that the chances of a single break-even player winning 2BB per hour or better after 100 hours are small (2.3 percent). But what if you're looking at 10 players, or 100, or 1,000. The chances of seeing at least one 2BB winner in the group go up rapidly, to 21 percent, 90 percent, and nearly 100 percent, respectively. The expected number of 2SEM winners among 1,000 break-even players would be about 23. If you used the same criterion that you would use for a single player, you'd have to give all 23 of them credit for beating the game, even though you know they're just break-even players. So, clearly, we need a more stringent criterion.

Fortunately, statisticians have ways of adjusting for multiple comparisons. The most commonly used method bears the name of its originator, Carlo Emilio Bonferroni, and offers a way to control the probability of one or more false results among a group of comparisons. Suppose that you're willing to live with being wrong about 5 percent of the time, which corresponds to 1.64SEM. Bonferroni correction means taking that rate and dividing it by the total number of comparisons. So, if you're collecting results from 100 people, in order to feel confident that someone is beating the game, you want to take your usual criterion of 5 percent and divide by 100 to get your new criterion of 0.05 percent (5 in 10,000). Glossing over a bit of calculation, this gives us a cutoff of 3.3SEM (roughly twice what we'd need for just a single player). Using that cutoff, you can feel confident that no more than 5 percent of the time would you make so much as a single error in 100 tries.

There are a few difficulties with this technique, the most important of which is getting the number of comparisons right. It's easy if you send 10 players out to the cardroom, have them play for exactly 100 hours, and then have them report back. It's a little harder to figure out the right correction if you're getting a more or less continuous influx of people telling you their results at opportune times, all with different numbers of hours played and different earning rates. And it's a lot harder if you don't actually hear about all of the results, which is the case when you're evaluating claims posted to newsgroups, or claims made by people you run into in the cardroom. People don't tend to post permanently archived and worldwide-accessible messages to a public newsgroup if their results have been mediocre. The effective number of comparisons is really the total number of people who would have posted if only they'd done well enough, a number that's hard to estimate.

Fortunately, differences in the severity of the correction get smaller as the numbers get bigger. The table below has some representative values for a 5 percent criterion (in the old days, numbers like these came from tables at the back of statistics textbooks, but today you can use any spreadsheet program, and there are numerous calculator applets available on the web). For good measure, I've thrown in the number for a .023 criterion, which is widely used because it comes out to a round number, 2SEM.

GRAPH 2

So, if you think there are 1,000 people out there who might have posted if they'd had good results, you can act as though you've made 1,000 statistical comparisons. Your criterion for believing someone is a winner should be an earning rate 3.9SEM better than break-even. Referring back to our first table, that means 3.9BB per hour after 100 hours, 1.3BB per hour after 900 hours, and so on. It's a fairly stringent criterion, but that's what happens when you perform 1,000 statistical comparisons. If you want to try out some numbers on your own, the calculations are fairly simple. (If I get a few e-mail requests, I'll put a spreadsheet and instructions for it up on my web site.) But even without the numbers, the useful lesson to take away from this is that if someone tells you they're beating the game, it's worth thinking carefully about how many other people would have told you the same thing had they also won. The larger the number, the more skeptical you should be.

While Bonferroni correction is a mildly obscure statistical procedure, it does have some intuitive appeal. It's often said that if you give enough monkeys enough typewriters, one of them eventually will produce the complete works of Shakespeare (and another will throw in a few bonus plays for good measure, equally brilliant but with banana-related themes). How surprised we should be when we see the results really depend on the number of monkeys. If we gave one monkey a typewriter and he delivered, we would probably be justified in calling him a great writer. But if we had a large enough group of monkeys (as a practical matter, far more monkeys than you could fit in the universe), we wouldn't think that monkey was particularly smart at all, just the lucky winner among a near-infinite cohort of random typists. In poker terms, the problem with paying too much attention to posted results is that it's hard to tell how many monkeys are out there.

Oddly enough, although some skepticism is appropriate when looking at random results that you find posted on RGP or that you hear about in the cardroom, the players who post those results are probably reasonably justified in believing them. From their individual points of view, they've done a simple experiment with one player, not thousands. For each of them, there's only one player they would have posted about (themselves). So, when someone tells you they've been doing exceptionally well, they may have good reason to believe so, even if you don't. It seems almost paradoxical, and it can be a little awkward when you're talking to someone about his early success, but that's the way it goes.