Featured post

Textbook: Writing for Statistics and Data Science

If you are looking for my textbook Writing for Statistics and Data Science here it is for free in the Open Educational Resource Commons. Wri...

Monday 1 April 2019

Bingo analysis, a tutorial in R



I'm toying with the idea of writing a book about statistical analyses of classic games. The target audience would be mathematically interested laypeople, much like Jeffrey Rosenthal's book Struck by Lightning ( https://www.amazon.ca/Struck-Lightning-Jeffrey-S-Rosenthal/dp/0006394957 ).

The twist would be that chapter would contain step-by-step R code or Python code so that the reader could do the same analysis and make changes based on their own questions. Material would like this post on Bingo, as well as my previous post on Snakes and Ladders ( https://www.stats-et-al.com/2017/11/snakes-and-ladders-and-transition.html ).

There would also be some work on chess variants, othello, poker, and possibly go, mahjong, and pente. Tied to each analysis could be light lessons on statistics. This Bingo analysis involves Monte Carlo style simulation, as well as notes on computing expected values, CDFs and PDFs.


### Bingo is all about lines, so first let's make a function that checks for the lines on a bingo card.
### There are 12 lines in a standard 5x5 bingo card: 5 rows, 5 columns, and 2 diagonals.



### Check line function
checklines = function(bingo_vec) ### Assume the input comes as a vector...
{
    ### Then convert it to a 5x5 matrix.
    ### If it's already a 5x5 matrix when input, this shouldn't change anything.
    bingo_mat = matrix(bingo_vec,nrow=5,ncol=5)

    linesgot = rep(FALSE,12)
   
    ### For each row, check if all five dabs have been gotten.
    ### If even one in the row is missed, return FALSE for that line.
    for(row in 1:5)
    {
        linesgot[row] = all(bingo_mat[row,])
    }
   
    ### For each column, check if all five dabs have been gotten.
    for(col in 1:5)
    {
        linesgot[col + 5] = all(bingo_mat[,col])
    }
   
    ### Check the Top-Left to Bottom-Right diagonal
    linesgot[11] = all(
    c(bingo_mat[1,1],bingo_mat[2,2],

bingo_mat[3,3],bingo_mat[4,4],bingo_mat[5,5]))

    ### Check the Top-Right to Bottom-Left diagonal
    linesgot[12] = all(
    c(bingo_mat[1,5],bingo_mat[2,4],bingo_mat[3,3],


bingo_mat[4,2],bingo_mat[5,1]))

    ### Set the 12 TRUE/FALSE answers for the 12 lines as the function's output.
    return(linesgot)
}


### Try the 'checklines' function on a sample bingo card

card01 = matrix(c(
0,0,1,0,0,
1,1,1,1,1,
0,0,1,0,0,
0,0,1,0,0,
0,0,1,0,0), byrow=TRUE, nrow=5, ncol=5)

cardTF = (card01 == 1)

card01   ## FALSE TRUE FALSE .... FALSE TRUE FALSE
cardTF   ## 2 8

## 2 and 8 should be true for the 2nd row and 3rd column
checklines(cardTF) 
which(checklines(cardTF))



####### Now let's scale this up!
### Repeat the following 10000 times
### Put the numbers from 1 to 25 in a random order
### 'dab' the spaces on a bingo card in that random order
### After each dab, check which lines on the bingo card are filled
### Record the number of lines in lines_so_far

Nruns = 10000
lines_so_far = matrix(NA,nrow=Nruns,ncol=25)


for(run in 1:Nruns)
{
    bingo_vec = rep(FALSE,25)
    dab_order = sample(1:25)

    for(k in 1:25)
    {
        bingo_vec[dab_order[k]] = TRUE
        lines_so_far[run,k] = sum(1*(checklines(bingo_vec = bingo_vec)))
    }
}


### lines_so_far is a 10000x25 matrix
### The value in the Rth row and the Cth column is the...
### ... number of lines after C dabs on the Rth bingo card.

### Here is how this happened for the first 6 bingo cards
head(lines_so_far)

########## Now let's check our work.
### Check 1: Are all 12 lines included after 25 dabs? Every time?
table(lines_so_far[,25])

### Check your work. Are there 0 lines in the first 4 dabs? Every time?
table(lines_so_far[,1])
table(lines_so_far[,2])
table(lines_so_far[,3])
table(lines_so_far[,4])


### Check your work. Are the number of lines always the same or decreasing?
### If correct, all answers to this code should be false
for(k in 5:24)
{
    result = any(lines_so_far[,k] > lines_so_far[,k+1])
    print(result)
}



#### Since everything seems to be working, let's do some analysis.
### Start with something simple
### Probability of getting first line by nth dab.
lines_01 = pmin(lines_so_far,1)
CDF = apply(lines_01,2,mean)
PDF = diff(CDF)  ## The pdf is the derivative of cdf

## Mean average dab for first line
sum(1 - CDF)  ### Answer:  13.8969 dabs to first line.

## Median average dab
min(which(CDF >= 0.5)) ### Answer: 15 dabs

## Standard deviation
EX = sum(2:25 * PDF)
EX2 = sum( (2:25)^2 * PDF)

sqrt(EX2 - EX^2)  ### 2.55 dabs


### Histogram of time of first line
barplot(PDF, names.arg=2:25, xlab="Dabs", ylab="Probability", las=1, main="Dab of first line, standard bingo")
abline(h=(1:10)*0.02)






###### Analysis 2:  What if you need two lines?
lines_02 = matrix(0,nrow=nrow(lines_so_far),ncol=ncol(lines_so_far))
lines_02[which(lines_so_far >= 2, arr.ind=TRUE)] = 1

##### Repeat the analysis
CDF_2 = apply(lines_02,2,mean)
PDF_2 = diff(CDF_2)

### Mean, median, sd
sum(1 - CDF_2)   ## 16.59 dabs
min(which(CDF_2 >= 0.5)) ## 18

EX = sum(2:25 * PDF_2)
EX2 = sum( (2:25)^2 * PDF_2)
sqrt(EX2 - EX^2) ## 1.73


### Histogram of time of second line
barplot(PDF_2, names.arg=2:25, xlab="Dabs", ylab="Probability", las=1, main="Dab of second line, standard bingo")
abline(h=(1:10)*0.02)







### Analysis 3: What if you wanted to know about two specific lines, like the diagonals?
### How many dabs until you could expect an X?

### Let's go back to the output of checklines. lines_so_far gives total number of lines, but not which lines.
### The diagonals are specifically the 11th and 12th lines being checked, so we can modify our Monte Carlo simulator to get those.

### Only two changes were made:
### 1. replace the name lines_so_far with diags_so_far
### 2. Only look at lines 11 and 12 when summing up the number of lines. [11:12]

Nruns = 10000
diags_so_far = matrix(NA,nrow=Nruns,ncol=25)


for(run in 1:Nruns)
{
    bingo_vec = rep(FALSE,25)
    dab_order = sample(1:25)

    for(k in 1:25)
    {
        bingo_vec[dab_order[k]] = TRUE
        diags_so_far[run,k] = sum(1*(checklines(bingo_vec = bingo_vec)[11:12]))
    }
}


### The results look like the lines_so_far results, but only two of the lines count.
head(diags_so_far)


### Now repeat the analysis for 2 lines,
### where the only 2 lines being counted are diagonal

diags_both = matrix(0,nrow=nrow(diags_so_far),ncol=ncol(diags_so_far))
diags_both[which(diags_so_far >= 2, arr.ind=TRUE)] = 1

##### Repeat the analysis
CDF_D = apply(diags_both,2,mean)
PDF_D = diff(CDF_D)

### Mean, median, sd
sum(1 - CDF_D) 
## 22.40 dabs to get an X
min(which(CDF_D >= 0.5)) ## 24 for a 50% chance, wow

EX = sum(2:25 * PDF_2)
EX2 = sum( (2:25)^2 * PDF_2)
sqrt(EX2 - EX^2) ## SD: 1.72

### Histogram of time of second line
barplot(PDF_D, names.arg=2:25, xlab="Dabs", ylab="Probability", las=1, main="Dab of both diagonals, standard bingo")
abline(h=(1:20)*0.02)
abline(h=(1:5)*0.10,lwd=2)









######### Analysis 4: Calls instead of dabs
### What if you couldn't dab all the time? That's what adulthood is like.
### In a real game of bingo, there are 75 numbers called.
### Instead of looking at number of dabs until a line is found.
### Let's investigate how many numbers need to be called instead.

### For this, we run the simulation again with three changes.
### 1) We generate a random ordering from 1 to 75 instead of 1 to 25.
### 2) Only the numbers 1 to 25 are used as bingo dabs. The rest provide no dabs.
### 3) Because the probabilities for each call number are smaller, the results are more sensitive to randomness, so the number of runs is increased from 10,000 to 50,000. (This may take a minute or two to run)


Nruns = 50000
lines_so_far_call = matrix(NA,nrow=Nruns,ncol=75)

for(run in 1:Nruns)
{
    bingo_vec = rep(FALSE,25)
    call_order = sample(1:75)

    for(k in 1:75)
    {
        if(call_order[k] <= 25)
        {
            bingo_vec[call_order[k]] = TRUE
        }
        lines_so_far_call[run,k] = sum(1*(checklines(bingo_vec = bingo_vec)))
    }
}

head(lines_so_far_call,n=2)


lines_01_call = pmin(lines_so_far_call,1)
CDF_C = apply(lines_01_call,2,mean)
PDF_C = diff(CDF_C)  ## The pdf is the derivative of cdf

## Mean average call for first line
sum(1 - CDF_C)  ## 42.49 numbers called until first line

## Median average call
min(which(CDF_C >= 0.5)) ## 44 for 50%

## Standard deviation
EX = sum(2:75 * PDF_C)
EX2 = sum( (2:75)^2 * PDF_C)

sqrt(EX2 - EX^2) ## SD: 9.44 numbers called


### Histogram of time of first line
barplot(PDF_C, names.arg=2:75, xlab="Bingo Calls", ylab="Probability", las=1, main="Bingo call of first line")
abline(h=(1:10)*0.005)








Follow up questions:

1) In Bingo, The first column 5 numbers from 1-15, the second column from 16-30 and so on. Does this change anything in the simulation or analysis?

2) In Bingo, sometimes the middle square [3,3] is given for free before the game starts. What does this change coding-wise? Results-wise?

3) In Analysis 4, checklines is called even when nothing has changed on the bingo card from last call. How would you improve the efficiency without changing the result?

4) Standard Bingo is played on 5x5 cards. What changes would be needed to the code to make this work for an arbitrary bingo card size?

No comments:

Post a Comment