kennary island

eat, enjoy, explore

Coupling and the Coupon Collector

| Comments

In the fall semester last year, I took Stat 110, an introductory statistics course focusing on probability. I had done probability in contest math from high school, but this course was my first real rigorous treatment of probability.

One of the most interesting problems I saw in the course involves Markov chains and a simple and elegant solution using another interesting problem we saw earlier in the course–the coupon collector’s problem.

Random-to-Top Shuffling Problem

Suppose $n$ cards are placed in order on a table. Consider the following shuffling procedure: Pick a card at random from the deck and place it on top of the deck. What is the expected number of times we need to repeat the process to arrive at a “random” deck, for some suitable definition of “random”?

To solve this question, we’ll need to answer a seemingly unrelated question first.

Coupon Collector’s Problem (aka. The Toy Collector’s Problem)

A certain brand of cereal always distributes a toy in every cereal box. The toy chosen for each box is chosen randomly from a set of $n$ distinct toys. A toy collector wishes to collect all $n$ distinct toys. What is the expected number of cereal boxes must the toy collector buy so that the toy collector collects all $n$ distinct toys?

Solution to the Toy Collector’s Problem

The key to understanding this problem is to break the task of collecting all $n$ distinct toys into different stages: what is the expected number of cereal boxes that the toy collector has to buy to get the $i$-th new toy?

Let random variable $X_i$ be the number of boxes it takes for the toy collector to collect the $i$-th new toy after the $i-1$-th toy has already been collected. (Note: this does NOT mean assign numbers to toys and then collect the $i$-th toy. Instead, this means that after $X_i$ boxes, the toy collector would have collected $i$ distinct toys, but with only $X_i - 1$ boxes, the toy collector would have only collected $i-1$ distinct toys.)

Clearly $E(X_1) = 1$, because the toy collector starts off with no toys. Now consider the $i$-th toy. After the $i-1$-th toy has been collected, then there are $n - (i-1)$ possible toys that could be the new $i$-th toy. We can interpret the process of waiting for the $i$-th new toy as a geometric distribution, where each trial is buying another cereal box, “success” is getting any of the $n - (i-1)$ uncollected toys, and “failure” is getting any of the already collected $i - 1$ toys. From this point of view, we see that $$X_i - 1 \sim \textrm{Geom}(p)$$ where the probability of success $p$ is $$p = \frac{n - (i-1)}{n}.$$

Here, our definition of the geometric distribution does NOT include the success. Using the expectation of a geometric distribution, we have that the expected number of cereal boxes the toy collector must collect to get the $i$-th new toy after collecting $i-1$ toys is $$E(X_i - 1) = \frac{1 - p}{p}$$ $$E(X_i) = \frac{1}{p} = \frac{n}{n - (i - 1)}.$$

Now let random variable $X$ to be the number of cereal boxes the toy collector needs to buy to collect all $n$ distinct toys. Since we have separated the process into collecting the $i$-th new toy, then $$X = X_1 + X_2 + \cdots + X_n.$$

Using linearity of expectation, we can compute the expected value of $X$ by summing the individual expectations of $X_i$. Thus, we obtain the following result: $$E(X) = E(X_1 + X_2 + X_3 + \cdots + X_n)$$ $$E(X) = E(X_1) + E(X_2) + E(X_3) + \cdots + E(X_n)$$ $$E(X) = n\left( 1 + \frac{1}{2} + \frac{1}{3} + \cdots + \frac{1}{n} \right).$$

This is the harmonic series! The harmonic series diverges to infinity and grows approximately as $\gamma + \log n$ where $\gamma \approx 0.57722$ is Euler’s constant. Thus, we can approximate the expected number of cereal boxes with: $$E(X) \approx n (\gamma + \log n).$$

Solution to the Random-to-Top Shuffling Problem

Markov Chains and Stationary Distrubutions

Coming back to the random-to-top shuffling problem, we first need to define our notion of “random” for our deck. In order to do this, we use Markov chains.

For our Markov chain, let our states be all $n!$ permutations of $n$-card deck, and two states are adjacent if and only if it is possible to reach one of the states from the other through one step of this shuffle. For any state, we move to one of its $n-1$ neighbors with probability $\frac{1}{n}$, or stay at the same state with probability $\frac{1}{n}$. Since all of our $n!$ states has degree $n$ (including loops), then by symmetry, the probability of having any permutation is equally likely. Thus, the stationary distribution for our random-to-top shuffling Markov process is the uniform vector $$\vec{s} = \left(\frac{1}{n!}, …, \frac{1}{n!}\right).$$

Thus, to define our notion of a “random” deck, we would like that after implementing our shuffling algorithm, the resulting deck is sampled from our stationary distribution: that is, our resulting deck is equally likely to be any of the $n!$ permutations.

Coupling

Now that we have established that our shuffling process can be modeled with a Markov chain that has a stationary distribution, we use the idea of “coupling” to arrive at our solution.

Let deck $A$ be our original deck, and let deck $B$ be uniformly randomly sampled from all $n!$ permutations. Since the stationary distribution for our shuffling process is the uniform distribution, then deck $B$ is sampled from the stationary distribution.

We use the fact that if we start our Markov process from a state sampled from the stationary distribution, then the resulting state will also be sampled from the stationary distribution. More formally:

Lemma. Let $\vec{s}$ be the stationary distribution of our Markov chain. Let $X_0$ be our starting state, and let it be sampled from the stationary distribution (i.e. $P(X_0 = i) = s_i$). Then the resulting state $X_1$ after running the Markov chain for one step will also be sampled from $\vec{s}$.

Now consider our “coupling” strategy: every time we move a card $C$ to the top of deck $A$, we locate card $C$ in deck $B$ and place it on top of the deck. Note that the physical process of how we chose card $C$ in the two decks is different: we choose a random position in deck $A$, whereas we located card $C$ in deck $B$. Although the process of how we chose card $C$ is different, from deck $B$’s perspective, $C$ is simply a card selected at random. Using our lemma, we have that deck $B$ still remains sampled from the stationary distribution after moving card $C$ to the top of deck $B$.

We note that after $t$ steps, all the cards that have been touched up to time $t$ will be in the same order on top of both decks. When all the cards of deck A and deck B are in the same order after some time $T$ steps, we will have that deck A and deck B are both sampled from the stationary distribution (because B always stays stationary through our coupling strategy). Thus, after $T$ steps, deck A will satisfy our notion of a “random” deck. We wish to compute $E(T)$.

How do we compute $E(T)$? We note that both decks will be the same once we have touched all the cards. Therefore, we wish to compute the expected number of random-to-top shuffles needed to touch all the cards. This is an instance of the coupon collector’s problem! Instead of touching all $n$ cards, we wish the collect all $n$ coupons. Thus, after approximately $n (\gamma + \log n)$ random-to-top shuffles, our original deck $A$ will be a “random” deck. For $n = 52$, we require $E(T) \approx 236$ shuffles to randomize our deck.

A Twist on Binary Search

| Comments

This past semester, I took a graduate course, CS 207 - Systems Development in Computational Science. In the course, we talked about good software engineering practices in C++ (but the lessons span beyond C++), in particular representation invariants, abstraction functions, and writing solid code specifications so that one could even prove things about code. The professor made a couple of blog entries for some of the lectures, explaining cool tricks with iterators and bits.

Early in the semester, we discussed several implementations of binary search, starting from a simplistic version and incrementally building up to a production-ready version. I thought the binary search discussion was an extremely eye-opening exercise; it was my first time seeing invariants being used in proofs to prove properties about code.

Below is how I’ve written binary search since high school:

Binary Search, Attempt #1 - search.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
/** Returns the index of any occurrence of @a x in @a a
 * @pre @a a has length equal to n
 * @pre @a a is sorted in increasing order
 * @return -1 if not found 
 */
int binary_search(int *a, int n, int x) {
  int lo = 0;
  int hi = n - 1;
  while (lo <= hi) {
    int mid = (lo + hi) / 2;
    if (x == a[mid])
      return mid;
    if (x < a[mid])
      hi = mid - 1;
    else
      lo = mid + 1;
  }
  return -1;
}

Here, I am using Doxygen style comments for my specifications. In this version of binary search, I return the index of any occurence of item x in array a, or return -1 if there is no such occurrence. While this implementation is acceptable for an array of ints, it is not particularly useful for other data types.

Using C++ templates, we can generalize this implementation to make it polymorphic for any type T, provided we provide a suitable comparison function compare where compare(p,q) returns true if and only if p is less than q for some ordering of values of type T. Thus, here is our attempt #2 at binary search:

Binary Search, Attempt #2 - search.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
/** Returns the index of any occurrence of @a x in @a a
 * @param compare(p,q) returns true if p < q
 * @pre @a a has length equal to @a n
 * @pre @a a is sorted by @a compare
 * @return -1 if not found 
 */
template<typename T, typename CMP>
int binary_search2(T *a, int n, T x, CMP compare) {
  int l = 0;
  int r = n;
  while (l < r) {
    int m = l + (r - l) / 2; // fix overflow issues
    if (compare(a[m],x))
      l = m + 1;
    else if (compare(x,a[m]))
      r = m;
    else
      return l;
  }
  return - 1;
}

Now, in order to call binary search, we must provide a function object compare that defines how we compare two elements of type T. Below is an example of how we would invoke this version of binary search:

Calling Binary Search, Attempt #2 - search.cpp
1
2
3
4
5
6
7
8
9
10
11
struct IntComp {
  bool operator()(int x, int y) {
    return x < y;
  }
};

int main(void) {
  int arr[12] = {2,3,4,5,7,8,9,11,13,15,16,17};
  std::cout << binary_search2(arr, 12, 15, IntComp()) << std::endl;
  return 0;
}

We overload operator() to allow IntComp objects to be invoked like functions, and we pass an instance of IntComp to binary_search2 whenever we perform a binary search on an array of ints.

Note one other difference between the two versions of binary search: in attempt #1, we had the line:

1
int mid = (lo + hi) / 2;

whereas in attempt #2, we replaced this line with:

1
int m = l + (r - l) / 2;

For all these years, I’ve been writing binary search incorrectly! In the first version, we may run into integer overflow if lo + hi happen to be greater than the maximum integer value for int! In the second version, we fix this subtle bug by first subtracting r and l, then halving the difference and add the result to l to calculate the new middle index m. By subtracting first, we are guaranteed that r - l will not overflow (by the implicit precondition that r and l are valid indices into the array and r > l), and thus m will also be a valid index into the array.

We have generalized our binary search to work on an array containing any type. But, we have actually done more than this. In C++, iterators overload pointer syntax to represent collections of items. Using iterators, we can represent an entire range of items in a collection with only two iterators–one pointing to the beginning of the collection, and one pointing to the “position” after the last element in the collection. See the CS 207 blog entries here for more information on C++ iterators. In our example, however, we represent the array collection with a pointer to the first position and the number of items in the list. Because binary search requires random access into our collection, any collection represented by a random access iterator will be able to use the second version of our binary search!

Can we still do better? In our specification for binary search, note that we allowed the index of any occurrence of our search item x to be returned. This ambiguity makes it difficult to make any real use of the return value of binary search (except simply to check whether the item is in the collection). Instead of returning any index, what if we returned a lower bound position of the element x in our collection? By lower bound, we mean the first index into the array at which we should insert x and still keep the elements in sorted order.

For example, with the array {0, 1, 2, 5, 5, 5, 7, 9}, the lower bound of 0 would be 0, because we can insert 0 into index 0 and still keep our array sorted. The lower bound of -1 is also 0 by a similar reasoning. The lower bound of 5 is 3 because 3 is the smallest index that we can insert 5 and keep the array sorted. Similarly, the lower bound of 6 is 6. Note that the lower bound of 10 is 8, which is not a valid index into the array. This is okay because the return value only indicates the index that one could insert an item and maintain the sorted property of the array.

To implement this, we can think of the array as a collection of boolean values where the entries are {false, false, ..., false, true, true, ... true} (all the falses occur together at the beginning of the array). The boolean values correspond to whether our target element x is less than or equal to the value in that array position. Our goal, then, is to find the first true in the array, or return the last position (indicating that placing x at the end of the array would maintain the sorted property of our array). Building on the polymorphism we introduced in attempt #2, here is attempt #3 using the lower bound idea:

Binary Search, Attempt #3 - search.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
/** Return the lower-bound position of @a x in @a a
 * @param compare(p,q) returns true if p < q
 * @pre @a a has length equal to @a n
 * @pre @a a is sorted by @a compare
 * @post return R where 0 <= R <= @a n and:
 *   For all 0 <= i < n, 
 *      i < R iff a[i] < x
 *      i >= R iff a[i] >= x 
 */
template<typename T, typename CMP>
int lower_bound(T *a, int n, T x, CMP compare) {
  int l = 0;
  int r = n;
  while (l < r) {
    int m = l + (r - l) / 2;
    if (compare(a[m],x))
      l = m + 1;
    else
      r = m;
  }
  return l;
}

Nice, clean, and simple!

Note that this version uses only one comparison instead of two (as we did in attempts #1 and #2)! This lower bound idea not only tells us whether our element x is the array, but where we should place it to keep the list sorted!

This code looks simple enough to verify the correctness by eyeballing it; but can we make this rigorous? Can we prove the correctness of this code? Yes! Here is the same piece of code but commented heavily with the proof of its own correctness.

Binary Search Lower Bound Proof - search.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
/** Return the lower-bound position of @a x in @a a
 * @param compare(p,q) returns true if p < q
 * @pre @a a has length equal to @a n
 * @pre @a a is sorted by @a compare
 * @post return R where 0 <= R <= @a n and:
 *   For all 0 <= i < n, 
 *      i < R iff a[i] < x
 *      i >= R iff a[i] >= x 
 */
template<typename T, typename CMP>
int lower_bound_proof(T *a, int n, T x, CMP compare) {
  // pre: for all i,j with 0 <= i <= j < n, we have a[i] <= a[j]
  // post: let R be the return value. Then 0 <= R <= n, and
  //   for all 0 <= i < n,
  //     i < R iff a[i] < x    (1)
  //     i >= R iff a[i] >= x  (2) 

  int l = 0;
  int r = n;
  while (l < r) {
    // PRE LOOP
    // loop invariant: l <= R <= r (always true in the loop)
    // decrementing function: d = r - l

    int m = l + (r - l) / 2; // if r - l >= 2, then (r - l)/2 >= 1,
                             //                so l < m < r
                             // if r - l == 1, then l = m < r, 
                             //                so l <= m < r

    if (compare(a[m],x)) {
      // we have a[m] < x. Then by (1), a[m] < x ==> m < R
      // then for all 0 <= i <= m, a[i] < x (b/c sorted)
      l = m + 1; // so l < l_new == m + 1 <= R
                 // r_new == r >= R
                 // so l_new <= R <= r_new
                 // and r_new - l_new < r - l (d decrements)
    } else {
      // we have a[m] >= x. Then by (2), a[m] >= x ==> m >= R
      // then for all m <= i < n, a[i] >= x (b/c sorted)
      r = m; // so r > r_new == m >= R
             // l_new == l <= R
             // so l_new <= R <= r_new
             // and r_new - l_new < r - l (d decrements)
    }

    // POST LOOP
    // loop invariant: l_new <= R <= r_new
    // decrementing function: r_new - l_new < r - l
  }

  // by the decrementing function, d eventually reaches 0;
  //      thus the loop terminates
  // by the loop invariant, we have l <= R <= r
  return l;
}

To prove the correctness, we make heavy use of the post condition:

1
2
3
4
5
6
/**
 * @post return R where 0 <= R <= @a n and:
 *   For all 0 <= i < n, 
 *      i < R iff a[i] < x
 *      i >= R iff a[i] >= x 
 */

Thus, all elements at indices less than the return value R are less than x, and all other elements are greater than or equal to x. We use this both of these if and only if conditions in the two branches of the if conditional to guide us on how we should update l or r.

In both of the branches of the conditional, we have that the new values of l and r are maintained so that l <= R <= r and still satisfy the post condition of the function. Thus, the statement l <= R <= r is a loop invariant of the while loop: it is always true on entering and leaving the loop. To ensure that the loop terminates, we require a decrementing function, a function that decreases on each iteration of the loop and is equal to zero when the loop terminates. In this case, the obvious choice for the decrementing function would be d = r - l. We show in both branches that the new values of l and r are such that r_new - l_new < r - l, and so d decreases on each iteration. When d = 0, we have that l = r, which is indeed when the loop terminates. Thus, our final line return l; is proven correct by the combination of our post conditions, pre conditions (array is sorted), loop invariant, and decrementing function. By analyzing the invariants in the code, the code almost writes itself! Cool!

To view the code in its entirety (along with a couple of simple test harnesses for each version of binary search), check out the source here.

New Layout, LaTeX, and Tag Clouds!

| Comments

I got a new octopress layout using Melandri’s layout here. I also discovered how to make cool striped backgrounds with StripeGenerator, use cool new fonts from The League of Movable Type, and use cool pre-made icons with Double-J Design. Hopefully the layout will encourage me to actually keep up with my tech blog!

I also installed $\LaTeX$ integration with Octopress using the handy hints from here. Now I can write pretty in-line equations like $e^{i\pi} + 1 = 0$ or centered equations like $$\int_{\Omega} \, d\omega = \int_{\partial \Omega} \, \omega.$$ Nice! Hopefully this will motivate me to write more math-related entries!

To keep track of tags, I installed a plugin to generate tag clouds (see the right sidebar) using this plugin here. I also finally discovered how to make background images that are just noises using this background generator here. I like the simplicity of these backgrounds!

GraphLint: Creating a Domain Specific Language for Graph Validation

| Comments

This is a bit late, but I had written a blog post for the project I worked on during my winter internship at Knewton. My article was published on their tech blog here. I created a domain specific language to verify certain predicates on directed property graphs (e.g. all nodes of type A are connected to exactly 2 nodes of type B via edges of type E). This was a great exercise in the material I learned from the compilers course I took in the fall. Designing the language, the validation engine, and engineering the whole project was an awesome experience!

First Post!

| Comments

This will be my tech blog where I post my thoughts (and hopefully some code) about computer science, math, and whatever the likes. I’m using Octopress for blog generation, and I’m hosting it on my Github page. Nice and simple. Yessss.