## Featured post

### Textbook: Writing for Statistics and Data Science

If you are looking for my textbook Writing for Statistics and Data Science here it is for free in the Open Educational Resource Commons. Wri...

## Saturday 30 January 2021

### Lottery tickets, baseball cards, and the coupon collector's problem

A man, a woman, an enby, and 27 elephants walk into a bar. The bartender looks at the group of 30 and asks "What is the probability that 2 or more of you have the same birthday?".

The group proceeds to disregard twins, leap years, and building codes. They talk amongst themselves, leave a little space for the reader to calculate or guess for themselves,

# Hint:

1 - prod( (365:336) / 365)

and respond "70.63%". Is this surprising, given there are only 30 of them and there are 365 days?

That's the Birthday Paradox ( https://en.wikipedia.org/wiki/Birthday_problem ) - only 23 people are needed for you to have about a 50/50 chance at a 'double' in birthdays. If you flip the problem upside down and ask "how many people do you need in a group until at least one person had every birthday", then you have the Coupon Collector's Problem.

Rather then worry about whether elephants are people, let's talk about the coupon collecting the way people at Topps and Panini do: with collectible cards and stickers.

If there are 365 different cards in a set, how many randomly selected cards do you need to get one of every card in the set, on average? Assume that every card is equally likely, and that there are thousands of copies of each card in the set.

It takes 1 card to get the first one.

The next card has a 364/365 chance of being a new one. By the geometric distribution, that means it will take 365/364 cards on average to get a new one.

If we have two different cards, the next one has a 363/365 chance of being a new one. Or, 365/363 cards on average to get a new one.

If we have N different cards, the next one has has a (365 - N) / 365 chance of being a new one. Or 365 / (365 - N) cards on average to get a new one.

When we have all but one card, this gets to be a real pain. Each card only has a 1/365 chance to be the one we want, so we need 365 cards on average to find it.

And how many do we need from the start? Just sum up all E[cards] to get new ones.

(365 / 365) + (365 / 364) + (365/363) + ... + (365/3) + (365/2) + (365/1)

=

1 + 1.003 + 1.006 + ... + 122.7 + 182.5 + 365

=

2364.65 cards, or about 6.5 times the number of different cards in the set.

You can check the answer yourself in R with the code

sum( 365 / (365:1) )

------------

So that's the basic Coupon Collector's problem, but there's lots of way you can mess with it.

- How many cards until I get 50% / 75% / 90% of the collection? (and I trade or buy the rest individually)

- What if some cards are rarer?

- What if the cards come in packs that guarantee they don't have doubles?

- What if you want 4 copies of each card?

- Are sports cards a good investment?

Instead of birthdays, let's use a real card set for the rest of this article: 2020 Topps Baseball Update.

---------------

What if you want 2 or more copies of each card?

With multiple copies, things get much more complex because of the number of steps involved.

With 1 copy, the steps are linear, as in this figure, so we can get an expected number of cards needed get a 1 copy of each card just by adding up the number of cards to move through each step.

But when we need even 2 copies of each card in a set, things get more complex. There's more than one path to get from (0,0) for 0 first copies and 0 second copies to (4,4) which is 4 first and second copies, and computing the expected number of cards to get there becomes complex.

We can, however, run a simulator many times to estimate the distribution of cards needed and take expected value, median, 90th percentile, or anything else we need from that.

This simulator code designed to be illustrative, not efficient. In this example, we are looking for at least 4 copies of each card from a set where everything is equally likely from a set of 100 cards.

setsize = 100

Ncopies_wanted = 4

set.seed(12345) # Remove this line to get a different result when you repeat

cardcount = rep(0, setsize)

while( min(cardcount) < Ncopies_wanted)

{

newcard = sample(1:setsize, 1)

cardcount[newcard] = cardcount[newcard] + 1

}

sum(cardcount)

# 921 cards

In one run of the simulation, it took 921 random cards to get 4 complete sets. How is that compared to getting 1 copy?

sum( setsize / (setsize:1))

# 518.73

921 / 518.73 = 1.775

It took less than twice as long to get four complete sets as it took to get one complete set. Is that lucky? It's a lot less than 4. If we run this many times, we can get a better picture.

cards_to_complete_N_sets = function(setsize, Ncopies_wanted)

{

cardcount = rep(0, setsize)

while( min(cardcount) < Ncopies_wanted)

{

newcard = sample(1:setsize, 1)

cardcount[newcard] = cardcount[newcard] + 1

}

return( sum(cardcount) )

}

Nruns = 1000

cards_needed = rep(NA,Nruns)

for(k in 1:Nruns)

{

cards_needed[k] = cards_to_complete_N_sets(100, 4)

}

mean(cards_needed)

# 1081.9

## Standard error of the mean

sd(cards_needed) / sqrt(length(cards_needed))

hist(cards_needed, n=20, las=1, col="LightBlue")

According to our figure, 921 was lucky but not unusual. It takes about 1082 cards on average to complete 4 sets of 100 cards each. That's still only 2.09 times as many cards as a single set takes.

That the number of cards needed for 4 sets (1082) is less than 4 times the cards needed for 1 set (518.73) is intuitive because you are farther along in your collection before new random cards start getting wasted. In other terms, you can start collecting your second copy of each card before you finish collecting your first copy.

There's a definite savings in finishing a card set by doing multiple sets at once.

How much of a savings? Let's find out. The following code fills a table of mean cards needed by number of copies and set size.

set_sizes = c(50,100,150,200)

copies = c(1,2,3,4,5,6,7,8,9,10)

Nruns = 1000

mean_cards_needed = matrix(NA, nrow=length(copies), ncol=length(set_sizes))

set.seed(12345)

for(i in 1:length(copies))

{

for(j in 1:length(set_sizes))

{

cards_needed = rep(NA,Nruns)

for(k in 1:Nruns)

{

cards_needed[k] = cards_to_complete_N_sets(set_sizes[j], copies[i])

}

mean_cards_needed[i,j] = mean(cards_needed)

print(c(i,j,mean_cards_needed[i,j]))

}

}

round(mean_cards_needed)

Table:

> round(mean_cards_needed)

[,1] [,2] [,3] [,4]

[1,]  223  512  841 1173

[2,]  326  722 1149 1606

[3,]  412  907 1443 2006

[4,]  493 1077 1695 2336

[5,]  570 1251 1947 2670

[6,]  645 1391 2199 2984

[7,]  715 1533 2413 3313

[8,]  786 1690 2641 3619

[9,]  852 1841 2850 3917

[10,]  925 1972 3077 4185

diff(mean_cards_needed[,1])

diff(mean_cards_needed[,2])

diff(mean_cards_needed[,3])

diff(mean_cards_needed[,4])

------------------------

What if some cards are rarer?

Because each random card has less likelihood of being a rare card, rare cards are more likely to be the last card needed as well as taking longer to collect.

Let's consider a set of 11 cards with 10 very common (.0999 chance) and 1 very rare (.001 chance). We can do this my modifying our simulation function to account for uneven chances.

cards_to_complete_N_sets_uneven = function(card_chance, Ncopies_wanted)

{

setsize = length(card_chance)

cardcount = rep(0, setsize)

while( min(cardcount) < Ncopies_wanted)

{

newcard = sample(1:setsize, size=1, prob=card_chance)

cardcount[newcard] = cardcount[newcard] + 1

}

return( sum(cardcount) )

}

Nruns = 1000

cards_needed = rep(NA,Nruns)

card_chances = c(.001, rep(.0999,10))

for(k in 1:Nruns)

{

cards_needed[k] = cards_to_complete_N_sets_uneven(card_chances, 1)

}

mean(cards_needed)

# 1032.449

sum( 11 / (11:1))

# 33.2

If all the cards are equally common, it only takes 33.2 cards to complete the set. But, when one of them is much rarer, the set becomes MUCH more difficult to complete.

If the rare card is the last one, then it would take 1 / .001 or 1000 cards on average to get it and complete the set, so it makes sense that it would take about 1000 more cards on average to complete a set with even a single rare one.

-----------------------------

What if there are no doubles in a pack of cards?

A box of Topps 2020 Update Baseball cards has 60 cards from their basic set of 300. All 60 of the basic cards are unique. How much does this change the equation? We can modify our simulator to 'sample without replacement' to find out.

packs_to_complete_N_sets = function(setsize, packsize, Ncopies_wanted)

{

cardcount = rep(0, setsize)

while( min(cardcount) < Ncopies_wanted)

{

newcard = sample(1:setsize, size=packsize, replace=FALSE)

cardcount[newcard] = cardcount[newcard] + 1

}

return( sum(cardcount) )

}

Nruns = 1000

packsize = 60

setsize = 300

cards_needed = rep(NA,Nruns)

set.seed(12345)

for(k in 1:Nruns)

{

cards_needed[k] = packs_to_complete_N_sets(setsize, packsize, 1)

}

mean(cards_needed)

# 1736.76

mean(cards_needed) / packsize # packs needed

#  28.946

sum(300 / (300:1)) # without duplicates

# 1884.799

There's two offsetting factors: A pack with all unique cards 'wastes' fewer cards on its own, but you need to buy the entire pack. Here with each pack covering 1/5 of the set (60 of 300), it's a net savings in cards.

------------------

Are sports cards a good investment?

Are sports cards a good investment? That's a matter of opinion. Here's my (very much non-expert) opinion on the matter:

The collector value of individual cards, let alone all of them, is hard to estimate. The value of cards goes up and down based on how famous a player is and how many people are actively collecting cards as a whole. That long-term bet is the reason that people make a big deal about rookie cards of players in their first year - because those players are relatively unknown and so there's probably fewer copies of a player out there as a rookie compared to later if they become a superstar.

But, given that some people regularly buy up a store's entire stock of a set to 'flip' the sales later, there's probably some profit in it with some expertise. If you're looking to invest in sports cards because you've read recently that it's a hot investment, remember that lots of other people read those articles too, just like they did in the beanie baby crash.

You can't just take those cards are turn them directly into individual resale value. Sports cards aren't cash, so the market has a lot of friction. For the valuable cards, there's a whole rating system, based on the condition of the card, and there's the 'work' of sleeving the cards and not getting fingerprints on them.

If you want to get into cards as a money-making play, look for an edge. See if you live an in area where there isn't a lot of 'flippers' who buy up all the retail boxes. Look into leagues that you think could get more attention later, like the WNBA (basketball), KHL (hockey), or NPB and KBO (baseball). This could be your chance to make a long-term bet on the league itself.

In a 67-card box (60 from the basic set plus 7 others) of Topps Update 2020, which costs about \$20 Canadian, the chances of getting something special (e.g. a card with an alternate colour or photo, a signed card, a limited run card, or a piece of memorabilia) is nearly 100%, with the less special things being far more likely.

Even when you don't get anything worth a lot of money, you still get 67 cards, and that's entertainment value. If sleeving and categorizing cards isn't work to you, then you're not making an investment so much as paying for entertainment with a small chance at a big payoff.

Which brings us to one last point of reference: scratch-and-win lottery tickets.

The scratch and win ticket "Crossword Extreme IV" costs \$20 Canadian as well, has a 40% chance of paying out anything, with the chances of prizes dropping greatly at as prizes increase. (Details here: https://lotto.bclc.com/scratch-and-win/tickets/ticket-management/crossword-extreme-iv-310103.html   )

As an investment, it's terrible, but at least it's less work to collect on the payout. As an entertainment product, personally I think of scratch-and-wins as card packs with a much shorter entertainment time, and no consolation prize of pictures of sports players if you don't win.

I chose this scratch-and-win game in particular because it takes a lot of time to play compared to most scratch-and-wins, so you're either getting the most entertainment for your money, or doing the most work for your investment, depending on your perspective.

--------------------------

Q1) If it costs \$0.20 to get a random card, and \$1 to buy specific individual cards, at what point is it cheaper to start buying individual cards than random cards? (Set of 50 cards, 1 set each, all equal chance)

Q2) What distribution does the number of copies of each card you have after a fixed number N of random cards drawn? Assume a set of size S all with equal chance 1/S. (Answer: Binomial. p = 1/S, n=N)

Q3) As the number of random cards gets large, what does this approximate? (Answer: Poisson, then eventually Normal)

Q4) Every additional set takes fewer additional cards to complete, but there is a natural minimum to the expected number of cards to complete an additional set. What is it and why? Assume every card is equally likely. (Answer: 1/setsize, because that's the number of cards to get the last card in the set, and we cannot start an additional set with fewer than 1 card remaining)

Q5) How many cards would it take to complete a set with many rares and one common card? Try this with 10 cards of .001 chance each, and 1 card of .99 chance. Is this what you would expect?

Q6) How many cards would it take complete a set with a gradient of rarities? Try this with 10 cards of probability 1/55, 2/55, 3/55, ... , 10/55.

Q7) For 300 equally likely cards, what is the pack size that would require you to buy the fewest total cards to complete one set. Round to the nearest pack size of 10.

Q8) Panini sells packs of stickers with 2 stickers per pack with no duplicates. In the 2018 World Cup set, there were 681 stickers in total. Did selling stickers in packs of 2 reduce the total number of stickers needed compared to buying them randomly 1 at a time?

Q9) The 2008 Magic: The Gathering set "Shards of Alara" has 229 cards: 101 are 'common', 60 are 'uncommon', 53 are 'rare', and 15 are 'mythic rare'.

In each pack of cards, there are 10 commons, 4 uncommons, and 1 rare. In 1/8 of all packs, there is also a mythic rare. Assume there are no duplicates in a pack. Assume that within each rarity, each card is equally likely.

Q9a) How many packs of cards does it take to complete a set, on average?

Q9b) How many packs of cards does it take to get 4 copies of each card?

Q9c) How many packs of cards does it take to get 4 copies of each card, excluding mythic rares?

Want more related content? This is part of a coursepack on Statistics and Gambling