\(\renewcommand{\epsilon}{\varepsilon}\) \(\renewcommand{\hat}{\widehat}\) \(\DeclareMathOperator*{\E}{\mathbb{E}}\) \(\renewcommand{\emptyset}{\varnothing}\)

Determining the direction of a coin's bias

The problem

Suppose you're playing a two-player game, like maybe ping pong or rock-paper-scissors. In each round, one player gets a point. It's possible to play any number of rounds, and nothing much changes from one round to the next. How long should you play if you want to figure out who is better at the game?

Let's model each round as a biased coin toss. One player gets a point with probability \(\frac{1}{2} + \epsilon\) and the other player gets a point with probability \(\frac{1}{2} - \epsilon\), independently each round, for some \(\epsilon > 0\). The goal is to determine which player is which.

When \(\epsilon\) (the magnitude of the bias) is known, this is a fundamental and well-studied topic in probability and statistics. By Hoeffding's inequality, if you toss the coin \(O(1/\epsilon^2)\) times, there's a 99% chance that the side toward which the coin is biased will come up more often.

But why would you and your opponent know in advance how closely matched your ping pong skills are? Not knowing \(\epsilon\) adds an extra layer of complexity to the problem.

The solution

The idea is, if a huge fraction of the first few tosses are heads/tails, then you can safely halt and declare that the coin is biased toward heads/tails respectively. But if the first few tosses are split relatively evenly between heads and tails, then you should go back to tossing the coin for a while before considering halting again. You repeat this process as long as it takes, tossing the coin many times, then considering halting. With each successive iteration, you can safely lower the threshold for halting. (A 60-40 split among a gazillion coin tosses is much more meaningful than a 60-40 split among 10 coin tosses.) No matter how small the true bias is, eventually you'll halt.

Here are the quantitative details. Let \(\delta > 0\) be the failure probability you're willing to tolerate. Define \(\epsilon_i = 2^{-i}\), \(\delta_i = \frac{6 \delta}{\pi^2 i^2}\), and \(n_i = \left\lceil \frac{\ln(1/\delta_i)}{2\epsilon_i^2} \right\rceil\). For each \(i\), after tossing the coin \(n_i\) times, let \(\hat{p}_i\) be the fraction of coin tosses that have come up heads so far. If \(\hat{p}_i \geq \frac{1}{2} + \epsilon_i\), halt and declare that the coin is biased toward heads. If \(\hat{p}_i \leq \frac{1}{2} - \epsilon_i\), halt and declare that the coin is biased toward tails. If \(\frac{1}{2} - \epsilon_i \lt \hat{p}_i \lt \frac{1}{2} + \epsilon_i\), keep tossing the coin.

Analysis: By Hoeffding's inequality, the probability that you halt and give the wrong answer after \(n_i\) tosses is at most \(\delta_i\). Therefore, by the union bound, the probability that you ever give the wrong answer is at most \(\sum_i \delta_i = \delta\). Meanwhile, once \(i\) is large enough that \(\epsilon_i \leq \epsilon / 2\), Hoeffding's inequality also shows that the probability that you have to toss the coin more than \(n_i\) times is at most \(\exp(-\epsilon^2 n_i / 2)\). From here, a little calculation shows that the expected number of coin tosses is \[O\left(\frac{\log(1/\delta) + \log \log(1/\epsilon)}{\epsilon^2}\right).\] Note that we pay a factor of \(O(\log \log(1/\epsilon))\) compared to the situation when \(\epsilon\) is known. This is unavoidable, as I'll explain later.

Some concrete numbers

Let's translate back to the language of games and points. The solution described above allows us to calculate a sequence \(V_1, V_2, \dots\) of "victory scores" and a sequence \(S_1, S_2, \dots\) of "survival scores". You get to declare victory if you ever reach some victory score \(V_i\) before your opponent reaches the corresponding survival score \(S_i\). Here are the first few scores for \(\delta = 0.05\).For these concrete numbers, I used \(\epsilon_i = 0.3 \cdot 1.2^{-i}\) instead of \(\epsilon_i = 2^{-i}\). This doesn't change the asymptotics, but it does affect the constants a little.

Victory score Survival score

Following those rules, the game will almost surely end eventually, and there's at least a 95% chance that the better player will win. So if you win under those rules, you can brag that you've mathematically demonstrated your superiority. Except there are several caveats.

  • It still might have just been a fluke. 5% is not really that low.
  • It's tempting to think that you should be 95% certain that you're the superior player, but even that's not quite right. To calculate how certain you should be, you need to take into account your prior beliefs and use Bayes' rule.
  • Independent tosses of a biased coin is not necessarily a good model in the first place. In my experience playing games, it feels like rounds are not actually independent of one another. For example, sometimes you have a burst of renewed focus and you do especially well for many rounds in a row.

My wife and I have been playing a game called "Spot It!" recently. In the simplest version of the game, each "round" is very quick: two cards are revealed, and whoever finds something in common first gets a point. (The cards are based on finite projective planes.) We've been trying to use the above algorithm (with slightly different parameters) to determine which of us is the better player. Unfortunately, it seems that it might take a while to resolve the question. So far, Alicia is winning 1090 - 1027.

Theoretically, you could also use an algorithm like this to run an election between two candidates. In each round, pick a voter at random and give a point to the candidate they voted for. This method might allow you to statistically determine the winner without having to tally all the votes. I don't think it would actually be a good idea though.

Optimality

Like I said earlier, using the simple algorithm I described, the expected number of coin tosses is \[ O\left(\frac{\log(1/\delta) + \log \log(1/\epsilon)}{\epsilon^2}\right). \] In this section, I'll outline a proof that no algorithm can do better, asymptotically. After I worked out the proof, I found this paper by Karp and Kleinberg that includes a proof of the same result.It seems that Karp and Kleinberg weren't too concerned about the small-\(\delta\) regime. Besides the lower bound, they also presented an algorithm for the problem, but their algorithm is non-optimal when \(\delta = o(1)\). (It's not surprising that others have studied this basic statistics problem.)

Let's start by proving that \(\Omega(\log(1/\delta) / \epsilon^2)\) coin tosses are necessary, even if \(\epsilon\) is known in advance. This bound is relatively well-known, but it will be useful to go through a proof, so that we can use the same ideas later to show that the \(\log \log(1/\epsilon)\) term cannot be eliminated.

Let \(X_{\epsilon}\) be the distribution over \(\{0, 1\}\) with \(X_{\epsilon}(1) = \frac{1}{2} + \epsilon\) and \(X_{\epsilon}(0) = \frac{1}{2} - \epsilon\). For the sake of what's coming up later, let's prove a lower bound for the problem of distinguishing \(X_{\epsilon}\) from a fair coin (\(X_0\)), which is essentially equivalent to the original problem of distinguishing \(X_{\epsilon}\) from \(X_{-\epsilon}\). Let \(X_{\epsilon}^n\) denote the joint distribution of \(n\) independent samples from \(X_{\epsilon}\). Let \(f \colon \{0, 1\}^n \to \{0, 1\}\) be any function,I'm bounding the worst-case number of samples here. The argument can be extended to lower-bound the expected number of samples; see Theorem 7.1 in this paper by Dagum et al. the interpretation being that \(f\) is trying to output 1 on \(X_{\epsilon}^n\) and 0 on \(X_0^n\). I'll show that \(f\) has failure probability at least \(\frac{1}{4} \cdot \exp(-8\epsilon^2 n)\).

Let \(\alpha\) denote the failure probability on \(X_{\epsilon}\) and let \(\beta\) denote the failure probability on \(X_0\), i.e., \(\beta = \E[f(X_0^n)]\) and \(\alpha = 1 - \E[f(X_{\epsilon}^n)]\). Let \(D(Y \| Z)\) denote the Kullback-Leibler divergence from \(Z\) to \(Y\). Plugging in the definitions, \(D(f(X_{\epsilon}^n) \| f(X_0^n))\) is given by \[ (1 - \alpha) \ln\left(\frac{1 - \alpha}{\beta}\right) + \alpha \ln \left(\frac{\alpha}{1 - \beta}\right), \] which is at least \((1 - \alpha) \ln(1/\beta) - \ln 2\) because \((1 - \alpha) \ln (1 - \alpha) + \alpha \ln \alpha \geq - \ln 2\). On the other hand, \[ \begin{align*} D(f(X_{\epsilon}^n) \| f(X_0^n)) &\leq D(X_{\epsilon}^n \| X_0^n) \\ &= n \cdot D(X_{\epsilon} \| X_0) \\ &\leq 4 \epsilon^2 n. \end{align*} \] (The first inequality follows from the data processing inequality for KL divergence. The equation in the middle follows from the chain rule for KL divergence. The last inequality is a straightforward computation.) Rearranging, we get \[ \beta \geq \frac{1}{2^{\frac{1}{1 - \alpha}}} \cdot \exp\left(-\frac{4 \epsilon^2 n}{1 - \alpha}\right). \] If \(\alpha > \frac{1}{2}\), we're done, and if \(\alpha \leq \frac{1}{2}\), we get \(\beta \geq \frac{1}{4} \cdot \exp(-8\epsilon^2 n)\) as promised.

Now let's move on to the unknown-\(\epsilon\) case. Suppose you've designed some algorithm such that for every \(\epsilon > 0\) and \(s \in \{\pm 1\}\), given sample access to \(X_{s \cdot \epsilon}\), the algorithm has at least a 99% chanceBy Markov's inequality, a lower bound on the number of coin tosses the algorithm uses with high probability implies a lower bound on the expected number of coin tosses. of halting within \(n(\epsilon)\) samples and outputting \(s\). I claim that there is some universal constant \(\gamma > 0\) and some universal sequence \(\epsilon_1, \epsilon_2, \dots \to 0\) such that for infinitely manyI'm merely claiming a lower bound for arbitrarily small \(\epsilon\), not for all sufficiently small \(\epsilon\). This is similar to the distinction between saying "\(n(\epsilon) = \Omega(\cdots)\)" vs. saying "\(n(\epsilon) \neq o(\cdots)\)". Indeed, \(n(\epsilon)\) is not necessarily \(\Omega\left(\frac{\log \log(1/\epsilon)}{\epsilon^2}\right)\). \(i\), \[ n(\epsilon_i) \geq \frac{\gamma \ln \ln(1/\epsilon_i)}{\epsilon_i^2}. \]

Proof: Let \(\epsilon_i = \exp(-i^2)\) and \(n_i = n(\epsilon_i)\). Assume for a contradiction that for all sufficiently large \(i\), the inequality is violated. Then for all such \(i\), we have \(n_i \leq \gamma / \epsilon_{i + 1}^2\). Therefore, if \(\gamma\) is sufficiently small, by the known-\(\epsilon\) lower bound, \(n_i\) samples are not enough to distinguish \(X_{\epsilon_{i + 1}}\) from \(X_{-\epsilon_{i + 1}}\). It follows that when sampling from either \(X_{\epsilon_{i + 1}}\) or \(X_{-\epsilon_{i + 1}}\), there's at least a \(\frac{1}{5}\) chance that the number of samples your algorithm uses lies somewhere between \(n_i\) and \(n_{i + 1}\). Assume without loss of generality that this happens when sampling from \(X_{\epsilon_{i + 1}}\). Let \(f \colon \{0, 1\}^{n_{i + 1}} \to \{0, 1\}\) be the indicator function for that event, i.e., \(f(x) = 1\) if and only if the number of samples your algorithm uses lies somewhere between \(n_i\) and \(n_{i + 1}\) when the first \(n_{i + 1}\) samples are described by \(x\).

What would happen if we gave your algorithm sample access to a fair coin (\(X_0\))? That's not a situation your algorithm was designed for, but it's a legitimate thought experiment. Just like in the known-\(\epsilon\) lower bound, define \(\beta = \E[f(X_0^{n_{i + 1}})]\) and \(\alpha = 1 - \E[f(X_{\epsilon_{i + 1}}^{n_{i + 1}})]\). We have already argued that \(\alpha \leq \frac{4}{5}\). Therefore, \[ \begin{align*} \beta &\geq \frac{1}{2^{\frac{1}{1 - \alpha}}} \cdot \exp\left(-\frac{4 \epsilon_{i + 1}^2 n_{i + 1}}{1 - \alpha}\right) \\ &\geq \frac{1}{32} \cdot \exp(-20 \epsilon_{i + 1}^2 n_{i + 1}). \end{align*} \]

Now, the events "the number of samples your algorithm uses lies between \(n_i\) and \(n_{i + 1}\)" are mutually exclusive as \(i\) varies, so the probability \(p\) that your algorithm ever halts satisfies \[ \begin{align} p&\geq \frac{1}{32} \cdot \sum_i \exp\left(-20 \epsilon_i^2 \cdot n_i\right) \\ &\geq \frac{1}{32} \cdot \sum_i \ln(1/\epsilon_i)^{-20\gamma} \\ &= \frac{1}{32} \cdot \sum_i \frac{1}{i^{40 \gamma}}. \end{align} \] For \(\gamma \leq \frac{1}{40}\), the sum diverges, a contradiction.

So if you play ping pong using these rules and the games take a long time to finish, don't blame the algorithm. The games will go faster if you get really good at ping pong 🙂

Thanks to Alicia Torres Hoza and Jalex Stark for helpful comments on drafts of this blog post.