Statistics et al.: R Seminar on programming with vectors

This is the first of five seminars that was given at Simon Fraser University. I've included this one in the blog because it makes for, hopefully, an excellent tutorial for intermediate users of R.

Inside:

- Vector operations
- Building a matrix
- Matrix operations
- Arrays, data frames, and data tables
- Indices
- For loops
- For loops on indices
- Apply functions
- Binding
- The permutation test – two sample t
- User-defined functions
- The replicate() function
- The by() function
- The combn() function - Permutation Test - All Combinations
- Synthesis – permutation test for ANOVA

Vector Operations in R

## Simple math operations are performed on every element in a vector
## If we start with 1 to 5
1:5

## We can double it
1:5 * 2

## Or add 3
1:5 - 3
c(3,9,2) + 5

## Vectors can also be added element by element
c(5,10,10) + c(1,2,3)
1:5 + 5:1

##Everything in R vectorizes!
X = rnorm(n=100, mean=0:99, sd=1:10)
plot(X)

paste("Label",1:10)
sample(1:10)
letters[1:10]
sample(letters[1:10])
list.files()

Matrix construction in R

### The general syntax is
### matrix( values, nrow = , ncol = )
### values is the values of the matrix
### nrow and ncol are the number of rows and columns
## It fills one column up before moving on to the next
mat.1 = matrix( 1:4, nrow = 2, ncol = 2)
mat.1

## If you don't put enough elements in the values part
## then it'll repeat whatever is there until the matrix is full
matrix(3, nrow = 2, ncol = 2)

### You can also take the transpose of a matrix with t( name of matrix )
t(mat.1)

### Or get the eigen values and eigen vectors (if the matrix is square)
eigen(mat.1)

### Or get the determinant
det(mat.1)

Matrix operations in R

## By default matrix operations are done element by element
A = matrix( 1:4, nrow = 2, ncol = 2)
B = matrix( 5:8, nrow=2, ncol=2)

### This is correct for addition,
A + B

### But it isn't how matrix multiplication is done
A * B

### To do a proper matrix multiplication, you need to use %*%
A %*% B

### We can test this with the determinant, beacuse det(AB) = det(A)det(B)

det(A)
det(B)
det(A)*det(B) ## What we should be getting
det(A * B) ## No
det(A %*% B) ## Yes

Arrays

Arrays are like matrices and vectors, but the number of dimensions are determined by the user.

# 1-dimensional array
arrA = array(1:10)
arrA = array(1:10, dim=10)

# 2-dimensional array
arrB = array(1:15, dim=c(3,5))
arrB = array(1:15, dim=c(3,5), dimnames=c("row","column"))

# 4-dimensional array
arrC = array(1:16, dim=c(2,2,2,2), dimnames=c("row","column","depth","hyper"))

Data frames are like matrices and they are two-dimensional, but they have additional meta data like object lists, and the ability to have different formats for different columns/variables.

Data tables are extensions to data frames that are optimized for big data.

# install.packages("data.table")
library(data.table)

varA = 1:5

varB = letters[1:5]

varC = as.factor(c("cat","dog","lizard","bunny","turtle"))

dataF = data.frame(varA,varB,varC)

dataT = data.frame(varA,varB,varC)

Indicies

## Specific elements of vectors, matrices, and arrays can be referenced as single elements

vec = 3:10
vec[1]
vec[7]
vec[10]

# as ranges
vec[2:6]
vec[ c(1,3,5)]

# as boolean conditions
vec[ vec>5]
vec_tf = c(TRUE,TRUE,FALSE,FALSE,FALSE,TRUE,TRUE,FALSE)
vec[vec_tf]

# or as negatives to indicate exclusion. (you can’t mix negative and positive)
vec[-1]
vec[-(1:4)]

## you can get the indicies that meet a certain criterion in a couple of ways
## -Getting the true or false

vec > 5

## Getting the 0-1 binary
1*(vec > 5)

## Getting the indicies themselves
which(vec > 5)

### Indicies work with matrices, arrays, and data frames/tables
matB = matrix(10:18,ncol=3,nrow=3)
matB[1:2,1:2]
matB[1, 1:3]
matB[4:7]
matB[4:10]
which(matB > 14)
which(matB > 14, arr.ind=TRUE)

arrE = array(101:124, dim=c(2,3,4))
arrE[1,1:2,1:2]
arrE[10:14]
which(arrE > 110)
which(arrE > 110, arr.ind=TRUE)

For loops in R

for(k in 1:5)
{
## things in here will happen 5 times
}

## Everything inside the { } will happen for each element in the list after the word "in".
## The list 1:5 has 5 numbers (which are elements), so everything happens five times.
for(k in 1:5)
{
print("That's one iteration")
}

## These elements can themselves be used in iterations.
for(count in 1:5)
{
print(count * 20)
}

## Finally, the list doesn't have to be counting up from one, either, it can be any vector. Elements can be included multiple times.
number_list = c(6,4,1,8,8,8)

for(count in number_list)
{
print(count * 10)
}

vowels = letters[c(1,5,9,15,21,25)]

for(letter in vowels)
{
print(paste0(letter,"pples and b",letter,"n",letter,"n",letter,"s!"))
}

For loops and indices in R

We can use indicies to run operations over individual elements of a vector, matrix, or array

idx_list = which(vec > 5)
part_sum = 0

for(idx in idx_list)
{
part_sum = part_sum + vec[idx]
}
part_sum

idx_list = which(arrE > 110)
part_sum = 0

for(idx in idx_list)
{
part_sum = part_sum + arrE[idx]
}

part_sum
idx_list2 = which(arrE > 110, arr.ind=TRUE)
part_sum = 0
for(k in 1:nrow(idx_list2))

{
idx1 = idx_list2[k,1]
idx2 = idx_list2[k,2]
idx3 = idx_list2[k,3]
part_sum = part_sum + arrE[idx1,idx2,idx3]
}

part_sum

### But often it’s cleaner and faster to do the operation on all relevant indicies at once.
sum(1*(vec > 5)*vec)

### or...
sum(vec[vec > 5])

## This is especially useful when the operations are identical for each element.
### And THAT is the motivation behind vectorization.
### The apply function is used to run a function FUN over a multi-dimensional object X, like a matrix, or array.
## The function FUN used in apply is done on successive subsets of X based on the selected margin, named margin.

#### The use of apply() is
## apply(X, margin, FUN)

## Sum of arrA (1-D), split across the 1st dimension.
apply(arrA,1,sum)

## Sum of arrA (1-D), split across the 2nd dimension. (fails)
apply(arrA,2,sum)

## Sum of arrB (2-D) by each
arrB
apply(arrB,1,sum)
apply(arrB,2,sum)

## Sum of data frame by each (fails)
apply(dataF,1,sum)
apply(dataF,2,sum)

## Sum of arrE (3-D) by individual dimensions, and MULTIPLE dimensions
apply(arrC, 1, sum)
apply(arrC, 2, sum)
apply(arrC, 3, sum)
apply(arrC, c(1,2), sum)
apply(arrC, c(1,2,3), sum)
sum(arrC)

### One more 4-D example
set.seed(12345)
x_iwoj = array(data=rpois(n=100*5*10*6, lambda=10),
dim=c(100,5,10,6),
dimnames=c("Player","Wicket","Over","Outcome"))
x_.wo. = apply(x_iwoj, c(2,3), sum)

## Sums (and means) return NA by default if any of the elements are missing or NA.
sum(arrB)
arrB_NA = arrB
arrB_NA[2] = NA
arrB_NA

apply(arrB_NA,1,sum)
apply(arrB_NA,2,sum)
sum(arrB_NA)

### However, sum() has an argument, na.rm= , which allows you to remove all missingness before sum is evaluated.
sum(arrB_NA, na.rm=TRUE)

### Additional arguments for the applied function FUN can be put into the apply() function. These arguments are carried into the applied function.
apply(arrB_NA,1,sum,na.rm=TRUE)
apply(arrB_NA,2,sum,na.rm=TRUE)

Binding and Concatenation in R

The rbind() and the cbind() function concatenates objects by row and column, respectively.

## It works for matrices
rbind(arrB,arrB)
cbind(arrB,arrB)

## and arrays
rbind(matB,matB)
cbind(matB,matB)

## and data frames and tables
rbind(dataF,dataF)
cbind(dataF,dataF)

## and vectors too, but cbind() transposes
rbind(letters[1:10],letters[2:11])
cbind(letters[1:10],letters[2:11])

### To concatenate arrays, we can use the abind() function from the abind package. For this function, the ‘along’ argument defines the dimension that the concatenation is done along.
abind(arrB,arrB,along=1)
abind(arrC,arrC,along=1)
abind(arrC,arrC,along=3)

Randomization/Permutation Test - Single Run

### ( See also: http://thomasleeper.com/Rcourse/Tutorials/permutationtests.html )

### Define the two samples
group1 = c(17, 15, 20, 27, 22, 20, 20, 18)
group2 = c(18, 15, 9, 18, 13, 17 ,7, 11, 12, 15, 26, 24)
pool = c(group1, group2) ## This makes a list of both groups together, a pool of 20 values.
set.seed(12345)

### Next, we randomly select one of the 125,970 combinations to go in
### decide which 8 of the 20 are going to be group1

## Random sample of 8 numbers from 1 to 20
samp_idx = sample(1:20,size=8)
rndm_grp1 = pool[samp_idx]

## Every element of the pool that was NOT in the sample of 8 numbers
rndm_grp2 = pool[-samp_idx]

## Let's say our alternative hypothesis was that group1's mean is higher than group2's mean
## If this is true, then it should be at least as large as a random difference between groups most of the time.
group_diff = mean(group1) - mean(group2)
random_diff = mean(rndm_grp1) - mean(rndm_grp2)

### Is it?
group_diff >= random_diff

Randomization/Permutation Test - For Loop

### Set up the group difference and a counter before we start
random_diff = rep(NA,10000)
observed_diff = mean(group1) - mean(group2)
set.seed(12345)

### For each of 10000 iterations...
for(k in 1:10000)
{
### Make the combination and assign the values to two groups
idx = sample(1:20,size=8)
rndm_grp1 = pool[idx]
rndm_grp2 = pool[-idx]

### Find the differences in the means of the random groups
random_diff[k] = mean(rndm_grp1) - mean(rndm_grp2)
}

### The p-value is the proportion of times the random group had a bigger mean difference
mean( 1*(random_diff >= observed_diff))

Creating custom R functions

Computations that you expect to recycle many times can be made into user-defined function.

For example, a sum-of-squares error functions

sum_sq = function(x,y)
{
result = sum( (x-y)^2 )
return(result)
}

Every user-defined function needs a name and some code inside the curly braces {}

Most, but not all, functions have at least one 'return' command. When a return command is reached, no other code in the function is run, and whatever object was being returned is the output value of the function. Any object can be returned, not just single values, but also vectors, matrices, arrays, data frames, lists, and errors.

Most, but not all, functions also have input arguments (sometimes called parameters, a term which we'll avoid).

input_x = 1:5
input_y = rep(3,5)
sum_sq(input_x, input_y)
SS = sum_sq(input_x, input_y)
SS

User-defined functions can be used in the apply() command, but one of the arguments should be named 'x' in the function definition as to help apply() identify what to do with the subsetted information.

matX = array(1:36, dim=c(6,6))
vecY = rep(0,6)

apply(matX, 1, sum_sq, y=vecY)
apply(matX, 2, sum_sq, y=vecY)

Randomization/Permutation Test - Vectorized

Now to repeat the permutation test in a vectorized fashion with a user-defined function and an apply function.

First, define a function that gets the mean difference between two randomly selected groups from our pool.We need the pool and n1. n2 is implied from the size of the pool and n1

get_rnd_meandiff = function(pool, n1)
{
N = length(pool)
idx = sample(1:N,size=n1)
rndm_grp1 = pool[idx]
rndm_grp2 = pool[-idx]
result = mean(rndm_grp1) - mean(rndm_grp2)
return(result)
}

#### Since the computation, including the resampling is identical for each run,
#### we can use the replication() function to quickly run the same code many times.

# replicate(n, expr) ## or...
# replicate(n, expr, simplify=”matrix”)
replicate(5, get_rnd_meandiff(pool, 8))
replicate(5, get_rnd_meandiff(pool, 8))

random_diffs = replicate(10000, get_rnd_meandiff(pool, 8))
mean( 1*(random_diffs >= observed_diff))

random_diffs = replicate(10000, get_rnd_meandiff(pool, 8))
mean( 1*(random_diffs >= observed_diff))

random_diffs = replicate(10000, get_rnd_meandiff(pool, 8))
mean( 1*(random_diffs >= observed_diff))

summary(random_diffs)
hist(random_diffs)

Permutation Test - All Combinations

What if, instead of randomly selected groups each time, we wanted to use each possible group assignment exactly once.

The combn() function will output, as a 2D array / matrix, each possible combination as a column.

## combn(x,m)
## Select m elements from the vector of elements x.

## warning!!! BIG!
all_combos = combn(1:20,8)
dim(all_combos)
choose(20,8)

Now we want to modify our function definition to take a specific set of index values instead of choosing random ones internally.

Recall that x is the value that apply subsets/iterates over so let x be the index values

get_set_meandiff = function(x, pool) ## n1 no longer needed, implied by x
{

N = length(pool)

idx = x ## Changed this line from a sample to an assignment
rndm_grp1 = pool[idx]
rndm_grp2 = pool[-idx]
result = mean(rndm_grp1) - mean(rndm_grp2)
return(result)

}

get_set_meandiff(all_combos[,1], pool=pool)
get_set_meandiff(all_combos[,1], pool=pool)
get_set_meandiff(all_combos[,1], pool=pool)
get_set_meandiff(all_combos[,2], pool=pool)
get_set_meandiff(all_combos[,2000], pool=pool)

### Finally, we can put this new function into an apply function.
all_diffs = apply(all_combos, 2, get_set_meandiff, pool=pool)

mean( 1*(all_diffs >= observed_diff))
summary(all_diffs)
hist(all_diffs)

Permutation Test - ANOVA

# Let's extend the random permutations to a simple ANOVA

get_rnd_anova = function(y, ni)
{

N = sum(ni)
k = length(ni)
group = rep(1:k, times=ni)
group = sample(group) # randomly permute the order of the groups
SS_total = sum( (y - mean(y)) ^2)

## Compute Sum of Squares error (i.e. within groups)
group_mean_idx = as.numeric(by(y,group,mean))
group_mean = group_mean_idx[group]
SS_error = sum( (y – group_mean)^2)

## Fill the rest of the ANOVA table with what we've done
SS_group = SS_total - SS_error
r_sq = SS_total / SS_group

DF_error = N - k
DF_group = k - 1
MS_error = SS_error / (N - k)
MS_group = SS_group / (k - 1)

F_stat = MS_group / MS_error

## Make the returned result a list, which is like a data frame in that not every element needs the same format, and how names can be assigned.

result = list(F_stat = F_stat, SS_total = SS_total,
SS_error=SS_error, SS_group=SS_group,
r_sq = r_sq, DF_error=DF_error,
DF_group=DF_group, MS_error = MS_error,
MS_group=MS_group)

return(result)

}

set.seed(1234)

y = rpois(12,lambda=10)
group = c(1,1,1,2,2,2,2,3,3,3,3,3)
ni = c(3,4,5)

group_mean_idx = as.numeric(by(y,group,mean))
group_mean = group_mean_idx[group]
observed_SSE = sum( (y - group_mean)^2)
by(y,group,mean)

observed_SSE

Try the data with the code inside the function first to see it in action before replicate()ing the function.

replicate(5, get_rnd_anova(y, ni))

random_SSE = unlist(replicate(10000, get_rnd_anova(y, ni))["SS_error",])

mean(1*(random_SSE <= observed_SSE))

One final caveat: “less thinking, more computing” - Andrew Henrey

Although apply() may run faster, it’s also important to consider the time it takes to create code. Your time is a lot more valuable than the computer’s time, so it doesn’t make sense to spend effort optimizing something that won’t be run very often or for very long periods.

One case of particular interest is when you’re doing many iterations of something in which the computation of one iteration depends on the results of at least one previous iteration.

Take this simple time series, for example, which we could write quickly with a for loop

Niter = 10000
x = numeric(Niter)
x[1] = rnorm(1)

for(k in 1:(Niter-1))
{
X[k+1] =0.99*x[k] + rnorm(1)
}

…however, to do so in a vectorized fashion could easily take more total time to write and run than the less elegant for-loop version*. In the optimization workshop, we will look other examples where each for loop (or while loop) is an improved estimate over the previous iteration, for which an apply function would be much harder than a loop.

*If the 0.99 multiplier were 1 instead, we could use the cumulative sum function cumsum(), as in

x = cumsum(rnorm(Niter))

Statistics et al.

Featured post

Textbook: Writing for Statistics and Data Science

Saturday, 24 June 2017

R Seminar on programming with vectors