Statistics et al.: Sampling, conditional probability, and random number generation

Part of the motivation behind making the course Statistics and Gambling is to infuse new applicability into introductory or intermediate probability courses. This blog post is a look at how the course is going to cover familiar probability topics with examples in games of chance, and a simulation-based (rather than theory-based) approach.

This post covers basic methods of random number generation (RNG) in R, and applying RNG to demonstrate core concepts in sampling, conditional probability, and conditional distributions. It is meant to be a very surface-level primer on the topics, just enough to give context for the deeper dives into specific games of chance.

In this post, we cover

1) Preliminary code

2) RNG (Random number generation) in sampling

3) Sampling with and without replacement

4) Conditional probability using sampling

5) RNG from a distribution

6) Conditional probability using a distribution

7) Conditional distributions

8) RNG and random seeds

Preliminary Code

Some of the code in this post relies on the following data frame and function.

card_rank = rep(c(2:10,"Jack","Queen","King","Ace"), times=4)

card_suit = rep(c("Clubs","Hearts","Diamonds","Spades"), each=13)

card_name = paste(card_rank, card_suit, sep=" of ")

df_cards_52 = data.frame(card_rank, card_suit, card_name)

names(df_cards_52) = c("rank","suit","name")

n.of.each = function(x)

{

out = rev(sort(as.numeric(table(x))))

return(out)

}

RNG using sampling

The set.seed() command ensures that the results you get are the same as those written here. Details about set.seed() are given in the optional section on random seeds at the end of this post.

This code samples all 5 numbers from 1 to 5 in a random order using R’s internal random number generator (RNG).

set.seed(8675309)

sample(x=1:5, size=5)

# Output: [1] 4 3 2 1 5

However, you can't sample 6 numbers from a population of 5 values without replacement. Replacement is explained after this section.

sample(x=1:5, size=6)

# Output: Error, cannot take a sample larger than the population when 'replace = FALSE'.

We can do it if we sample with replacement, where each number is put back into the pool after we sampled it.

sample(x=1:5, size=6, replace=TRUE)

# Output: [1] 5 3 1 5 3 1

By default, every value in x is equally likely to be selected, but we can change that with the prob setting. The following code samples from the number 1 80% of the time, and 2, 3, 4, or 5 5% of the time each.

sample(x=1:5, size=20, prob=c(.800, .05, .05, .05, .05), replace=TRUE)

# Output: [1] 1 1 3 1 1 3 3 1 1 1 1 1 1 3 2 3 1 1 1 1

The probabilities don't need to add up to 1 if you prefer to use weights instead, like this, code, were 1 is selected 4 times as often as 2, 3, 4, or 5.

sample(x=1:5, size=20, prob=c(4,1,1,1,1), replace=TRUE)

# Output: [1] 1 1 1 4 2 2 3 5 1 1 1 1 1 5 5 1 5 1 4 1

We can also use it to sample from things that aren't numbers in a range. The following code sets up a deck of 52 cards and draws 5 cards from the deck.

hand = sample(x= df_cards_52$name, size=5)

hand

# Output: "2 of Diamonds" "7 of Clubs" "7 of Spades" "Queen of Clubs" "10 of Clubs"

Sampling with and without replacement

There are two fundamental ways to sample. Both ways start out the same: select an object from a population randomly (e.g., select a person from the residents of a city, or a card from a deck).

If you are sampling with replacement, you include place that object back into the sample, so that they same object can be sampled again later (e.g., you put the card back into the deck after selecting it).

If you ware sampling without replacement, you remove the object from the sample so that every object can be sampled either once or not at all (e.g., You keep the card in your hand after selecting it. After you call someone for a poll you make a note not to call that phone number again).

This block of code samples from the numbers 1 to 5 without replacement. If you don't specify, the sample() function will always sample without replacement.

set.seed(12345)

sample(1:5, size=5, replace=FALSE)

# Output:

# [1] 3 4 2 5 1

# [1] 2 1 3 5 4

# [1] 2 5 3 4 1

Notice how after each line, the numbers 1 to 5 each appear exactly once. If we were to sample with replacement, as we do in the following code block, sometimes numbers appear more than once in a line.

set.seed(12345)

sample(1:5, size=5, replace=TRUE)

# Output:

# [1] 3 2 4 2 5

# [1] 3 2 3 2 1

# [1] 1 4 4 2 4

Most games with cards or tiles use sampling without replacement because multiple cards are drawn before shuffling the deck or tiles. Most games with dice use sampling with replacement because the result of any one die roll does not affect the result of other dice rolls.

Examples of games that use sampling without replacement include:

1) Bingo, which samples from a set of 75 balls.

2) Texas Hold'Em Poker, which samples from a deck of 52 cards.

3) Pai Gow Poker, which samples from a deck of 53 cards (52 + 1 joker).

4) Mahjong, which samples from a set of 144 tiles or cards.

Examples of games that use sampling with replacement include:

5) Craps, which involves rolls of five six-sided dice repeatedly.

6) Yahtzee, which involves rolls of two six-sided dice repeatedly.

7) Borel, which involves rolls of multiple variable-sided dice repeatedly.

8) Hubbub, which involves flipping five two-sided dice repeatedly.

There are some activities can be treated as somewhere between sampling with and without replacement, these include:

9) Blackjack, which typically draws cards from a 'shoe' of several decks shuffled together, effectively making one large superdeck of hundreds of cards. A superdeck of ten decks, or 520 cards, is large enough that strategies that rely on sampling without replacement, like counting cards, are rendered much less effective.

10) Buying randomized packs of collectible cards, collectible stickers, or trading cards. By buying cards in a pack or box, you are sampling without replacement from a population of thousands or millions of printed cards or stickers - a population large enough that the difference between sampling with or without replacement doesn't matter. Also, packs of collectibles have a more complicated probability distribution.

Conditional probability using sampling

Imagine this situation: You draw one card from a standard deck of 52 cards - it's an ace. What is the chance that the second card from your deck will also be an ace?

One way to solve this is by counting.

- There are 4 aces in a deck of 52 cards.

- After drawing the first ace, there are 3 aces left in a deck of 51 cards.

Pr(2nd ace | 1st ace) = 3/51 = 1/17 = 0.05824

The vertical bar means 'if', or 'given'.

Counting methods work well for simple cases where counting principles can be applied easily. For moderately complicated card situations, we can also use combinatorics, which are covered elsewhere. We can also use Monte Carlo, or simulation-based methods to estimate things. Counting and combinatorics can provide exact answers to simple and moderately complex probability problems, Monte Carlo methods can provide approximate answers to very complex probability problems (for example, if some cards were more likely to appear than others, for some reason).

The following code block estimates the conditional probability of drawing a second ace from a deck after drawing the first ace. It's overkill, but it shows the concept on a problem that we already know the answer to.

First, we simulate drawing a two-card hand many times.

set.seed(12345)

Nsims = 50000

all_hands = matrix("", nrow=Nsims, ncol=2)

for(k in 1:Nsims)

{

all_hands[k,] = sample(df_cards_52$rank, size=2)

}

Next, define event A as "first card is an ace", and event B as "second card is an ace", and estimate the probability of event B given event A.

count_A = length(which(all_hands[,1] == "Ace"))

count_AandB = length(which(all_hands[,1] == "Ace" & all_hands[,2] == "Ace"))

pr_BifA = count_AandB / count_A

Nsims #Output: 50000

count_A #Output: 3719

count_AandB #Output: 216

pr_BifA #Output: 0.05808

To reiterate, we drew 50000 two-card hands, and of those 3719 had an ace as the first card, and of those 3719 first-ace hands, 216 of them had two aces. From those numbers we can estimate the conditional probability as 216/3719 = 0.05808, which is close to, but not exactly the true value of 0.05824.

We could use these event counts to estimate the probability of the first card being an ace as 3719 / 50000, or 0.07438, as compared to the true value of 4/52 = 1/13 = 0.7692.

We could also estimate the chance of both cards of being aces as 216 / 50000 = 0.00432, as compared to the true value of 1/221 = 0.00452.

By increasing the number of simulations Nsims, we could increase the accuracy, but it would take longer to calculate.

Conditional probability is calculated by pr(B|A) = pr(A and B) / pr(A). We get the conditional probability by estimating the base probabilities using counts as well, which is what the following code block does.

pr_A = length(which(all_hands[,1] == "Ace")) / Nsims

pr_AandB = length(which(all_hands[,1] == "Ace" & all_hands[,2] == "Ace")) / Nsims

pr_BifA = pr_AandB / pr_A

pr_A # 0.07438

pr_AandB # 0.00432

pr_BifA # 0.05808

RNG from a distribution

RNG can do a lot more than sample from a pre-made set of values. We can also use it to pick a value (or many values) from a probability distribution. Commonly used distributions include the uniform, Poisson, geometric, and binomial, but there are many, many more distributions that can be used.

par(mfrow = c(2,2))

barplot(dunif(1), xlim=c(0, 1.5), main="Uniform, min=0, max=1", ylab="Density")

barplot(dbinom(x=0:10, size=10, prob=0.5), main="Binomial, n=10, p=0.5", ylab="Density")

barplot(dpois(x=0:10, lambda=5), main="Poisson, lambda=5", ylab="Density")

barplot(dgeom(x=0:10, prob=0.25), main="Geometric, n=10, prob=0.25", ylab="Density")

par(mfrow = c(1,1))

Uniform

The function 'runif' is short for 'Random from UNIForm distribution'. This is the most basic of random number generation, and we won't be using it directly very much.

This generates a random number between 0 and 1.

set.seed(12345)

runif(n=1)

# Output: [1] 0.7209039

This generates 10 random numbers between 0 and 1.

runif(n=10)

# Output (rounded): [1] 0.876 0.761 0.886 0.456 0.166 0.325 0.509 0.728 0.990 0.035

This generates a random number between 3 and 5.

runif(n=1, min=3, max=5)

# Output: [1] 3.304747

You can use runif() to produce random events of a known probability. For example, this is a simple coin flip program.

coin = "heads"

if(runif(1) < .5){coin = "tails"}

coin

# Output: [1] "heads"

Binomial

The function 'rbinom' is short for 'Random from BINOMial distribution'. This is used to count the number of 'successes' in a fixed number of trials. (e.g., the number of goals from a fixed number of shots, or the number of wins in a season of fixed length).

We can make an even shorter coin flip program with rbinom

rbinom(n=1, size=1, prob=0.5)

# Output: [1] 0

Or flip 10 coins and get the results from each one.

rbinom(n=10, size=1, prob=0.5)

# Output: [1] 0 0 0 0 0 1 0 0 1 1

Or flip 10 coins all at once and count the number of heads.

rbinom(n=1, size=10, prob=0.5)

# Output: [1] 6

In rbinom, n determines the number of values to generate, size determines the number of trails, and prob determines the probability of success. A success can be defined as anything binary, or a yes/no thing.

This RNG code treats rolling a 6 on a 6-sided die as a success (prob = 1/6), two dice are rolled (size = 2), and this is repeated 10 times (n = 10).

rbinom(n=10, size=2, prob=1/6)

# Output: [1] 0 1 0 0 0 1 0 0 0 0

Similarly, this code generates the number of wins in 10 seasons of 162 games of an above-average Major League Baseball team, with a 55% win chance.

rbinom(n=10, size=162, prob=0.55)

# Output: [1] 91 102 95 90 84 82 89 81 93 86

This does the same thing for a below-average team.

rbinom(n=10, size=162, prob=0.45)

# Output: [1] 70 69 67 67 69 86 80 80 66 62

Poisson

The Poisson distribution is used to count events in a fixed time. (e.g, goals in a soccer game). A typical English Premier League soccer game has 1.2 goals per team, so we can generate the goals an average team gets in 10 games with the following code.

rpois(n=10, lambda=1.2)

# Output: [1] 0 0 1 2 1 2 0 3 2 0

The average NHL team scores about 2.85 goals in regulation.

rpois(n=10, lambda=2.85)

# Output: [1] 3 4 3 4 4 1 2 2 3 7

Geometric

The geometric distribution counts the number of successes until the first failure (or the number of failures until the first success).

The lengths of 10 winning streaks of a team with a 70% winning chance (and a 30% chance of breaking the streak) in any given game can be generated with this code.

If you want to count the number of wins in a 'streak', use this:

rgeom(n=10, prob=0.3)

# Output: [1] 0 0 5 4 0 2 0 1 3 3

If you want include the number of games, which includes the loss that ends the streak, add a +1 to the end of the code like this:

rgeom(n=10, prob=0.3) + 1

# Output: [1] 1 3 3 1 2 1 1 3 5 2

The chance of getting two aces from a two-card hand (i.e., pocket aces) is 1/221. The number of hands you would have to wait before getting the pocket aces, not including the hand with the pocket aces, can be simulated with

rgeom(n=10, prob=1/221)

# Output: [1] 229 254 57 410 27 103 133 184 262 374

Conditional probability using distributions

We can use RNG from distributions to find conditional probability too.

Consider a hypothetic game between two soccer teams the Tigers and Lions. We'll define two events.

Event A: The Tigers score 2 or more goals.

Event B: The Tigers win (i.e, they score more than the Lions).

We can find the probability these events (Pr(A) and Pr(B)), and use that to find those to find some conditional probabilities, (Pr(B given A) and Pr(A given B))

We start by generating goals scored for a large number of games between each team. Assume for now each team is average, and that they score an average of 1.2 goals per game. Other possible scenarios are considered in the exercises.

set.seed(12345)

Nsims = 100000 # The number of games simulated

g_tigers = rpois(Nsims, lambda=1.2)

g_lions = rpois(Nsims, lambda=1.2)

Next, we estimate probabilities based on the proportion of simulations that fit the events.

pr_win = length(which(g_tigers > g_lions)) / Nsims

pr_draw = length(which(g_tigers == g_lions)) / Nsims

pr_loss = length(which(g_tigers < g_lions)) / Nsims

pr_score2 = length(which(g_tigers >= 2)) / Nsims

pr_A = pr_score2

pr_B = pr_win

Next, we estimate the probability of the intersection, Pr(A and B), which is the case where both events happen. In this case, that's where the Tigers score 2 or more goals AND win. Since these two events aren't independent (scoring more makes a team more likely to win), we can't just multiply the two probabilities.

pr_AB = length(which(g_tigers >= 2 & g_tigers > g_lions)) / Nsims

From that we can get conditional probabilities, like Pr(Tigers win IF they score 2 or more), and Pr(Tiger score 2 or more IF they win)

pr_BifA = pr_AB / pr_A

pr_AifB = pr_AB / pr_B

pr_A # Output: 0.33655

pr_B # Output: 0.36205

pr_AB # Output: 0.25257

pr_AifB # Output: 0.6976108

pr_BifA # Output: 0.7504680

To recap, the Tigers are estimated to have a 33.7% chance of winning, and the Lions have an estimated 36.2% chance of winning. In theory, these two numbers should be the same because the distributions that generated them are the same, so the simulation approach is showing its weaknesses. The two probabilities don’t add to 100% because there is a large chance of a draw/tie where neither team wins.

The chance of the Tigers winning AND scoring two or more is 25.2%.

Also, when the Tigers score 2 or more, they win 69.8% of the time.

Likewise, when the Tigers win, 75.0% of the time they do so by scoring 2 or more.

Conditional distributions

The number of goals in sports like hockey, soccer, and lacrosse fundamentally depend on two things: The number of shots taken, and the goalie's general ability to stop shots. There are lots of other factors involved, but for the sake of simplicity, let's assume for now that the number of shots a goalie faces follow a Poisson distribution, and that goals follow a binomial distribution. That is, every shot is independent and has the same change of going in the net.

For NHL hockey, ignoring overtime, the average team takes about 30 shots per game, and about 9.5% of those shots become goals. (For an average of 2.85 goals per game)

We could simulate the number of goals of an average team using rpois() and rbinom()

set.seed(12345)

Nsims = 100000

Nshots = rpois(n=Nsims, lambda=30)

Ngoals = rbinom(n=Nsims, size=Nshots, prob=.095)

summary(Ngoals) # Mean, Median, Quartiles

# table(Ngoals) / Nsims # Probability distribution of goals

#Output:

# Min. 1st Qu. Median Mean 3rd Qu. Max.

# 0.000 2.000 3.000 2.855 4.000 13.000

mean(Ngoals) # 2.855, close to 20 * 0.095

var(Ngoals) # 2.872, close to the mean

We could filter these simulations to estimate conditional distributions. For example: What is the average number of goals scored when the team gets 25 shots?

#summary(Ngoals[Nshots == 25])

mean(Ngoals[Nshots == 25]) # Output: 2.353

var(Ngoals[Nshots == 25]) # Output: 2.110, less than the mean.

We would expect the answer to be close to 25 * 0.095.

25 * 0.095 # Output: 2.375

What is the mean and variance of the number of goals conditional on there being 40 or more shots?

#summary(Ngoals[Nshots >= 40])

mean(Ngoals[Nshots >= 40]) # Output: 3.998

var(Ngoals[Nshots >= 40]) # Output: 3.499, less than the mean again.

We can filter in the opposite direction too. What is the distribution of the number of shots taken in 5-goal games?

summary(Nshots[Ngoals == 5])

# Output:

# Min. 1st Qu. Median Mean 3rd Qu. Max.

# 15.0 29.0 32.0 32.2 36.0 52.0

Finally, we can visualize all this with a side-by-side boxplot. Each column represents the distribution of shots taken conditional on the number of goals. The coloured box in each column is the middle half of that distribution, so in our simulation, the 50% of shutouts (zero-goal games) had between 25 and 31 shots. The number of shots in a game increases with the number of goals, but a lot less than you might expect.

boxplot(Nshots ~ Ngoals)

Number of shots (x) by number of goals (y), stats-et-al.com

(Optional) RNG and random seeds

The random number generator (RNG) in R is not truly random. Like any computer without specialized hardware or some external randomness source, R uses a pseudo random number generator. The randomness is governed by a seed.

By default, R uses the exact time in milliseconds of the computer's clock when R starts up to generate a seed. (Slot machines and card shuffling in online poker software usually use similar methods for their pseudo random number generators.)

We can change this seed in two ways. First, we can force R to use a particular random seed of our choosing. In the following code block, we set an arbitrary seed before running code that uses randomness. Notice that when we use the same seed, we get the same results. The results will be the same even if the code is run on a different computer.

set.seed(1337)

sample(x=1:5, size=5) # Output: [1] 2 3 5 1 4

sample(x=1:5, size=5) # Output: [1] 2 4 3 5 1

sample(x=1:5, size=5) # Output: [1] 5 2 1 3 4

set.seed(1337)

sample(x=1:5, size=5) # Output: [1] 2 3 5 1 4

sample(x=1:5, size=5) # Output: [1] 2 4 3 5 1

sample(x=1:5, size=5) # Output: [1] 5 2 1 3 4

set.seed(420)

sample(x=1:5, size=5) # Output: [1] 5 1 2 3 4

sample(x=1:5, size=5) # Output: [1] 1 5 4 3 2

In the following block of code, we force R to pick a seed based on the computer's clock. Notice that running the same code gets different results each time.

set.seed(NULL)

sample(x=1:5, size=5) # Output: [1] 2 4 5 1 3

sample(x=1:5, size=5) # Output: [1] 2 3 5 1 4

sample(x=1:5, size=5) # Output: [1] 2 1 5 3 4

set.seed(NULL)

sample(x=1:5, size=5) # Output: [1] 3 4 5 2 1

sample(x=1:5, size=5) # Output: [1] 2 5 1 3 4

sample(x=1:5, size=5) # Output: [1] 5 1 3 4 2

In the code blocks in the rest of this section, set.seed(12345) is placed at the beginning of each block so that anyone can recreate the same results even though randomness is used. Remove it or use set.seed(NULL) to produce your own unique results.

Exercises

Estimate the probabilities pr(A), pr(B), and pr(A and B), and pr(A|B), and pr(B|A) from the “Tigers vs. Lions” example in “conditional probability using distributions section” for the following situations.

A) Tigers are a better than the Lions. (Tigers score an average 1.4 goals/game, and Lions score an average 1.0 goals/game).

B) Tigers are worse than the Lions. (Tigers average 1.0 goals/game, Lions average 1.4 goals/game)

C) Tigers and Lions are both very attack-heavy teams (Tigers and Lions both average 2.0 goals/game)

D) Tigers and Lions are both very defense-heavy teams (Tigers and Lions both average 0.5 goals/game)

Statistics et al.

Featured post

Textbook: Writing for Statistics and Data Science

Tuesday, 19 October 2021

Sampling, conditional probability, and random number generation

No comments:

Post a Comment