This is the first of five seminars that I'll be giving at Simon Fraser University next week. I've included this one in the blog because it makes for, hopefully, an excellent tutorial for intermediate users of R.
Agenda:

Vector operations

Building a matrix

Matrix operations

Arrays, data frames, and data tables

Indices

For loops

For loops on indices

Apply functions

Binding

The permutation test – two
sample t

Userdefined functions

The replicate() function

The by() function

The combn() function 
Permutation Test  All Combinations

Synthesis – permutation
test for ANOVA
######
##
Vector Operations
##
Simple math operations are
performed on every element in a vector
##
If we start with 1 to 5
1:5
##
We can double it
1:5
* 2
##
Or add 3
1:5
 3
c(3,9,2)
+ 5
##
Vectors can also be added element by element
c(5,10,10)
+ c(1,2,3)
1:5
+ 5:1
##Everything
in R vectorizes!
X
= rnorm(n=100, mean=0:99, sd=1:10)
plot(X)
paste("Label",1:10)
sample(1:10)
letters[1:10]
sample(letters[1:10])
list.files()
###########
Matrix construction
###
The general form is
###
matrix( values, nrow = , ncol = )
###
values is the values of the matrix
###
nrow and ncol are the number of rows and columns
##
It fills one column up before moving on to the next
mat.1
= matrix( 1:4, nrow = 2, ncol = 2)
mat.1
##
If you don't put enough elements in the values part
##
then it'll repeat whatever is there until the matrix is full
matrix(3,
nrow = 2, ncol = 2)
###
You can also take the transpose of a matrix with t( name of matrix )
t(mat.1)
###
Or get the eigen values and eigen vectors (if the matrix is square)
eigen(mat.1)
###
Or get the determinant
det(mat.1)
##########
Matrix operations
##
By default matrix operations are done element by element
A
= matrix( 1:4, nrow = 2, ncol = 2)
B
= matrix( 5:8, nrow=2, ncol=2)
###
This is correct for addition,
A
+ B
###
But it isn't how matrix multiplication is done
A
* B
###
To do a proper matrix multiplication, you need to use %*%
A
%*% B
###
We can test this with the determinant, beacuse det(AB) = det(A)det(B)
det(A)
det(B)
det(A)*det(B)
## What we should be getting
det(A
* B) ## No
det(A
%*% B) ## Yes
############
Arrays
###
Arrays are like matrices and vectors, but the number of dimensions
are determined by the user.
#
1dimensional array
arrA
= array(1:10)
arrA
= array(1:10, dim=10)
#
2dimensional array
arrB
= array(1:15, dim=c(3,5))
arrB
= array(1:15, dim=c(3,5), dimnames=c("row","column"))
#
4dimensional array
arrC
= array(1:16, dim=c(2,2,2,2),
dimnames=c("row","column","depth","hyper"))
###
Data frames are like matrices and they are twodimensional, but they
have additional meta data like object lists, and the ability to have
different formats for different columns/variables.
###
Data tables are extensions to data frames that are optimized for big
data.
#
install.packages(“data.table”)
library(data.table)
varA
= 1:5
varB
= letters[1:5]
varC
= as.factor(c("cat","dog","lizard","bunny","turtle"))
dataF
= data.frame(varA,varB,varC)
dataT
= data.frame(varA,varB,varC)
##############
indicies
##
Specific elements of vectors, matrices, and arrays can be referenced
#as
single elements
vec
= 3:10
vec[1]
vec[7]
vec[10]
#
as ranges
vec[2:6]
vec[
c(1,3,5)]
#
as conditions
vec[
vec>5]
vec_tf
= c(TRUE,TRUE,FALSE,FALSE,FALSE,TRUE,TRUE,FALSE)
vec[vec_tf]
#
or as negatives (you can’t mix negative and positive)
vec[1]
vec[(1:4)]
##
you can get the indicies that meet a certain criterion in a couple of
ways
##
Getting the true or false
vec
> 5
##
Getting the 01 binary
1*(vec
> 5)
##
Getting the indicies themselves
which(vec
> 5)
###
Indicies work with
matrices, arrays, and data frames/tables
matB
= matrix(10:18,ncol=3,nrow=3)
matB[1:2,1:2]
matB[1,
1:3]
matB[4:7]
matB[4:10]
which(matB
> 14)
which(matB
> 14, arr.ind=TRUE)
arrE
= array(101:124, dim=c(2,3,4))
arrE[1,1:2,1:2]
arrE[10:14]
which(arrE
> 110)
which(arrE
> 110, arr.ind=TRUE)
##########
###
For loops
for(k
in 1:5)
{
##
things in here will happen 5 times
}
##
Everything inside the { } will happen for each element
in the list after the word "in".
##
The list 1:5 has 5 numbers (which
are elements), so everything
happens five times.
for(k
in 1:5)
{
print("That's
one iteration")
}
##
These elements can
themselves be used in
iterations.
for(count
in 1:5)
{
print(count
* 20)
}
##
Finally, the list doesn't have to be counting up from one, either, it
can be any vector
number_list
= c(6,4,1,8)
for(count
in number_list)
{
print(count
* 10)
}
vowels
= letters[c(1,5,9,15,21,25)]
for(letter
in vowels)
{
print(paste0(letter,"pples
and b",letter,"n",letter,"n",letter,"s!"))
}
###
For loops and indices
###
We can use indicies to run operations over individual elements of a
vector, matrix, or array
idx_list
= which(vec > 5)
part_sum
= 0
for(idx
in idx_list)
{
part_sum
= part_sum + vec[idx]
}
part_sum
idx_list
= which(arrE > 110)
part_sum
= 0
for(idx
in idx_list)
{
part_sum
= part_sum + arrE[idx]
}
part_sum
idx_list2
= which(arrE > 110, arr.ind=TRUE)
part_sum
= 0
for(k
in 1:nrow(idx_list2))
{
idx1
= idx_list2[k,1]
idx2
= idx_list2[k,2]
idx3
= idx_list2[k,3]
part_sum
= part_sum + arrE[idx1,idx2,idx3]
}
part_sum
###
But often it’s cleaner and faster to do the operation on all
relevant indicies at once.
sum(1*(vec
> 5)*vec)
###
or...
sum(vec[vec
> 5])
##
This is especially useful when the operations are identical for each
element.
###
And THAT is the motivation behind vectorization.
###
The apply function is used to run a function FUN over a
multidimensional object X, like a matrix, or array.
##
The function FUN used in apply is done on successive subsets of X
based on the selected margin, named margin.
####
The use of apply() is
##
apply(X, margin, FUN)
##
Sum of arrA (1D), split across the 1st dimension.
apply(arrA,1,sum)
##
Sum of arrA (1D), split across the 2nd dimension. (fails)
apply(arrA,2,sum)
##
Sum of arrB (2D) by each
arrB
apply(arrB,1,sum)
apply(arrB,2,sum)
##
Sum of data frame by each (fails)
apply(dataF,1,sum)
apply(dataF,2,sum)
##
Sum of arrE (3D) by individual dimensions, and MULTIPLE dimensions
apply(arrC,
1, sum)
apply(arrC,
2, sum)
apply(arrC,
3, sum)
apply(arrC,
c(1,2), sum)
apply(arrC,
c(1,2,3), sum)
sum(arrC)
###
One more 4D example
set.seed(12345)
x_iwoj
= array(data=rpois(n=100*5*10*6, lambda=10),
dim=c(100,5,10,6),
dimnames=c("Player","Wicket","Over","Outcome"))
x_.wo.
= apply(x_iwoj, c(2,3), sum)
##
Sums (and means) return NA by default if any of the elements are
missing or NA.
sum(arrB)
arrB_NA
= arrB
arrB_NA[2]
= NA
arrB_NA
apply(arrB_NA,1,sum)
apply(arrB_NA,2,sum)
sum(arrB_NA)
###
However, sum() has an argument, na.rm= , which allows you to remove
all missingness before sum is evaluated.
sum(arrB_NA,
na.rm=TRUE)
###
Additional arguments for the applied function FUN can be put into the
apply() function. These arguments are carried into the applied
function.
apply(arrB_NA,1,sum,na.rm=TRUE)
apply(arrB_NA,2,sum,na.rm=TRUE)
###
Binding / Concatenation
###
The rbind() and the cbind() function concatenates objects by row and
column, respectively.
##
It works for matrices
rbind(arrB,arrB)
cbind(arrB,arrB)
##
and arrays
rbind(matB,matB)
cbind(matB,matB)
##
and data frames and tables
rbind(dataF,dataF)
cbind(dataF,dataF)
##
and vectors too, but cbind() transposes
rbind(letters[1:10],letters[2:11])
cbind(letters[1:10],letters[2:11])
###
To concatenate arrays, we can use the abind() function from the abind
package. For this function, the ‘along’ argument defines the
dimension that the concatenation is done along.
abind(arrB,arrB,along=1)
abind(arrC,arrC,along=1)
abind(arrC,arrC,along=3)
###############
###
Randomization/Permutation Test  Single Run
###
( See also:
http://thomasleeper.com/Rcourse/Tutorials/permutationtests.html
)
###
Define the two samples
group1
= c(17, 15, 20, 27, 22, 20, 20, 18)
group2
= c(18, 15, 9, 18, 13, 17 ,7, 11, 12, 15, 26, 24)
pool
= c(group1, group2) ## This makes a list of both groups together, a
pool of 20 values.
set.seed(12345)
###
Next, we randomly select one of the 125,970 combinations to go in
###
decide which 8 of the 20 are going to be group1
##
Random sample of 8 numbers from 1 to 20
samp_idx
= sample(1:20,size=8)
rndm_grp1
= pool[samp_idx]
##
Every element of the pool that was NOT in the sample of 8 numbers
rndm_grp2
= pool[samp_idx]
##
Let's say our alternative hypothesis was that group1's mean is higher
than group2's mean
##
If this is true, then it should be at least as large as a random
difference between groups most of the time.
group_diff
= mean(group1)  mean(group2)
random_diff
= mean(rndm_grp1)  mean(rndm_grp2)
###
Is it?
group_diff
>= random_diff
################
###
Randomization/Permutation Test Test  For Loop
###
Set up the group difference and a counter before we start
random_diff
= rep(NA,10000)
observed_diff
= mean(group1)  mean(group2)
set.seed(12345)
###
For each of 10000 iterations...
for(k
in 1:10000)
{
###
Make the combination and assign the values to two groups
idx
= sample(1:20,size=8)
rndm_grp1
= pool[idx]
rndm_grp2
= pool[idx]
###
Find the differences in the means of the random groups
random_diff[k]
= mean(rndm_grp1)  mean(rndm_grp2)
}
###
The pvalue is the proportion of times the random group had a bigger
mean difference
mean(
1*(random_diff >= observed_diff))
###################
####
Creating custom functions
####
Computations that you expect to recycle many times can be made into
userdefined function.
###
For example, a sumofsquares error functions
sum_sq
= function(x,y)
{
result
= sum( (xy)^2 )
return(result)
}
###
Every userdefined function needs a name and some code inside the
curly braces {}
###
Most, but not all, functions have at least one 'return' command. When
a return command is reached, no other code in the function is run,
and whatever object was being returned is the output value of the
function. Any object can be returned, not just single values, but
also vectors, matrices, arrays, data frames, lists, and errors.
###
Most, but not all, functions also have input arguments (sometimes
called parameters, a term which we'll avoid).
input_x
= 1:5
input_y
= rep(3,5)
sum_sq(input_x,
input_y)
SS
= sum_sq(input_x, input_y)
SS
###
Userdefined functions can be used in the apply() command, but one of
the arguments should be named 'x' in the function definition as to
help apply() identify what to do with the subsetted information.
matX
= array(1:36, dim=c(6,6))
vecY
= rep(0,6)
apply(matX,
1, sum_sq, y=vecY)
apply(matX,
2, sum_sq, y=vecY)
####################
#####
Randomization/Permutation Test Test  Vectorized
#####
Now to repeat the permutation test in a vectorized fashion with a
userdefined function and an apply function
####
First, define a function that gets the mean difference between two
randomly selected groups from our pool.
#
We need the pool and n1. n2 is implied from the size of the pool and
n1
get_rnd_meandiff
= function(pool, n1)
{
N
= length(pool)
idx
= sample(1:N,size=n1)
rndm_grp1
= pool[idx]
rndm_grp2
= pool[idx]
result
= mean(rndm_grp1)  mean(rndm_grp2)
return(result)
}
####
Since the computation, including the resampling is identical for each
run,
####
we can use the replication() function to quickly run the same code
many times.
#
replicate(n, expr) ## or...
#
replicate(n, expr, simplify=”matrix”)
replicate(5,
get_rnd_meandiff(pool, 8))
replicate(5,
get_rnd_meandiff(pool, 8))
random_diffs
= replicate(10000, get_rnd_meandiff(pool, 8))
mean(
1*(random_diffs >= observed_diff))
random_diffs
= replicate(10000, get_rnd_meandiff(pool, 8))
mean(
1*(random_diffs >= observed_diff))
random_diffs
= replicate(10000, get_rnd_meandiff(pool, 8))
mean(
1*(random_diffs >= observed_diff))
summary(random_diffs)
hist(random_diffs)
################
###
Permutation Test  All Combinations
###
What if, instead of randomly selected groups each time, we wanted to
use each possible group assignment exactly once.
###
The combn() function will output, as a 2D array / matrix, each
possible combination as a column.
##
combn(x,m) ## Select m elements from the vector of elements x.
##
warning!!! BIG!
all_combos
= combn(1:20,8)
dim(all_combos)
choose(20,8)
####
Now we want to modify our function definition to take a specific set
of index values instead of choosing random ones internally.
###
recall that x is the value that apply subsets/iterates over
###
so let x be the index values
get_set_meandiff
= function(x, pool) ## n1 no longer needed, implied by x
{
N
= length(pool)
idx
= x ## Changed this line from a sample to an assignment
rndm_grp1
= pool[idx]
rndm_grp2
= pool[idx]
result
= mean(rndm_grp1)  mean(rndm_grp2)
return(result)
}
get_set_meandiff(all_combos[,1],
pool=pool)
get_set_meandiff(all_combos[,1],
pool=pool)
get_set_meandiff(all_combos[,1],
pool=pool)
get_set_meandiff(all_combos[,2],
pool=pool)
get_set_meandiff(all_combos[,2000],
pool=pool)
###
Finally, we can put this new function into an apply function.
all_diffs
= apply(all_combos, 2, get_set_meandiff, pool=pool)
mean(
1*(all_diffs >= observed_diff))
summary(all_diffs)
hist(all_diffs)
#########
Permutation Test  ANOVA (As time permits)
#
Let's extend the random
permutations to a simple ANOVA
get_rnd_anova
= function(y, ni)
{
N
= sum(ni)
k
= length(ni)
group
= rep(1:k, times=ni)
group
= sample(group) # randomly permute the order of the groups
SS_total
= sum( (y  mean(y)) ^2)
##
Compute Sum of Squares error (i.e. within groups)
group_mean_idx
= as.numeric(by(y,group,mean))
group_mean
= group_mean_idx[group]
SS_error
= sum( (y – group_mean)^2)
##
Fill the rest of the ANOVA table with what we've done
SS_group
= SS_total  SS_error
r_sq
= SS_total / SS_group
DF_error
= N  k
DF_group
= k  1
MS_error
= SS_error / (N  k)
MS_group
= SS_group / (k  1)
F_stat
= MS_group / MS_error
##
Make the returned result a list, which is like a data frame in that
not every element needs the same format, and how names can be
assigned.
result
= list(F_stat = F_stat, SS_total = SS_total, SS_error=SS_error,
SS_group=SS_group, r_sq = r_sq, DF_error=DF_error, DF_group=DF_group,
MS_error = MS_error, MS_group=MS_group)
return(result)
}
set.seed(1234)
y
= rpois(12,lambda=10)
group
= c(1,1,1,2,2,2,2,3,3,3,3,3)
ni
= c(3,4,5)
group_mean_idx
= as.numeric(by(y,group,mean))
group_mean
= group_mean_idx[group]
observed_SSE
= sum( (y  group_mean)^2)
by(y,group,mean)
observed_SSE
###
Try the data with the code inside the function first to see it in
action before replicate()ing the function.
replicate(5,
get_rnd_anova(y, ni))
random_SSE
= unlist(replicate(10000, get_rnd_anova(y, ni))["SS_error",])
mean(1*(random_SSE
<= observed_SSE))
One
final caveat: “less thinking, more computing”  Andrew Henrey
Although
apply() may run faster, it’s also important to consider the time it
takes to create code. Your time is a lot more valuable than the
computer’s time, so it doesn’t make sense to spend effort
optimizing something that won’t be run very often or for very long
periods.
One
case of particular interest is when you’re doing many iterations of
something in which the computation of one iteration depends on the
results of at least one previous iteration.
Take
this simple time series, for example.
Which
we could write quickly with a for loop
Niter
= 10000
x
= numeric(Niter)
x[1]
= rnorm(1)
for(k
in 1:(Niter1))
{
X[k+1]
=0.99*x[k] + rnorm(1)
}
…however,
to do so in a vectorized fashion could easily take more total time to
write and run than the less elegant forloop version*. In the
optimization workshop, we will look other examples where each for
loop (or while loop) is an improved estimate over the previous
iteration, for which an apply function would be much harder than a
loop.
*If
the 0.99 multiplier were 1 instead, we could use the cumulative sum
function cumsum(), as in
x
= cumsum(rnorm(Niter))