Tuesday, 28 May 2019

Package Spotlight: anim.plots

The package anim.plots behaves like a sort of user-friendly shell on top of animate that makes animations of some of the most common types of plots in base R in a more intuitive fashion that animate.

This package depends on two other important packages:

-   magick, which is an R implementation of imageMagick, which itself is software used to create animated gifs from still images.

-   animation, which is an R library that can be used to create animations from any collection of plots.
The package ggAnimate is also worth mentioning, which is a ggPlot-based equivalent of anim.plots. For higher quality, but higher overhead images and animations, I recommend the ggPlot/ggAnimate path.

Basic plots: Dot/Scatter Plots


### First, some data about the animations:
## resol will be the number of data points plotted for each frame of the animation.
## Nframes is the number of frames
## times is an index that dictates which data point is plotted in which frame.
resol = 100 ## Resolution
Nframes = 10 ## Number of frames
times = rep(1: Nframes, each= resol)

### Next, let’s set up some x and y coords to plot
x = rep(1:resol/Nframes, Nframes)
y = sin(x*times/4)

## So the first frame will be everything where times is 1.
x[times == 1]
y[times == 1]

## ..up to the last frame
x[times == Nframes]
y[times == Nframes]

### The actual animation code will split the x and y by the ‘times’ argument and draw a frame for each one. Since
anim.plot(x,y,times, type="l", col="orange", lwd=2)

Animation speed

The default speed is 1 frame per second. That can be changed with the speed argument.
# 2 FPS
anim.plot(x,y,times, type="l", col="orange", lwd=2, speed=2)
# 4 FPS
anim.plot(x,y,times, type="l", col="orange", lwd=2, speed=4)

### You can also make times a single number, which will dictate the number of frames. The x and y coordinates used will be split evenly.
anim.plot(1:10, 1:10, times=5, cex=3)

The window argument

Alternatively, the ‘window’ argument will dictate which of the x and y are used for each frame. The number of frames is the length of x (and y).
It uses an internal variable called t.
If you want only the t’th x coord used on the t’th frame, use only t.
anim.plot(1:10, 1:10, window=t, cex=3)

## If you want all the coords up to and including the t’th one on the t’th frame, use 1:t.
anim.plot(1:10, 1:10, window=1:t, cex=3)

## If you want more of ‘moving window’ effect, use (t-2):t
anim.plot(1:10, 1:10, window=1:t, cex=3)

## Setting a fixed value or range (without using t) just makes a static animation with all the frames having that range.
anim.plot(1:10, 1:10, window=3:7, cex=3)

Replaying and modifying animations

The replay() function can be used to play an animation that has already been defined.
Modifications to the animation can be made as additional arguments inside replay
myAnim = anim.plot(x,y,times, type="l", col="orange", lwd=2, speed=4)

replay(myAnim) ## Replay once

for(k in 1:5) ## Loop 5 times
replay(myAnim, speed=10)

Much more than that, the before and after arguments allow you to evaluate expressions before and after each frame is rendered, respectively.
after is particularly useful for drawing additional details like legends.

replay(myAnim, after=legend("topleft", legend="My legend"))

replay(myAnim, after=abline(h=0))

If you are looking to run more than one line after (or before) each frame, the best way to do this is to create your own function of all the things you want to do and put a call to that function as the "after" argument instead.

guidelines = function(distance, shift, location=NA)
        location  = shift + (-50:50)*distance
     abline(h = location)

fourlegs = function()
legend("topleft", legend="Leg 1")

legend("topright", legend="Leg 2")
legend("bottomleft", legend="Leg 3")
legend("bottomright", legend="Leg 4")  

doboth = function(distance, shift, location=NA)
     guidelines(distance, shift, location)

replay(myAnim, after=guidelines(1,0))
replay(myAnim, after=fourlegs(1,0))
replay(myAnim, after=doboth(1,0)) 




Combining / merging animations

For more complicated additions, the merge() function is also useful.
merge(a,b) will render the frames of b on top of the frames of a, provided that both have the same number of frames and the same speed.
Note that in the following example, part2 is using anim.points() which is the animation of the points() command for adding points to an existing plot.

part1 = anim.plot(1:5, 1:5, speed=2, pch=17, cex=3)
part2 = anim.points(1:5, 5:1,col="red", speed=2, pch=8, cex=3)
both = merge(part1, part2)

# Merges of three or more animations are possible, but you have to use the individual parts.
part3 = anim.points(1:5, rep(3,5) ,col="blue", speed=2, pch=12, cex=2)
all3 = merge(part1,part2,part3)

Saving your animations

The function anim.plot() is a wrapper around the animation package’s saveGIF() function. saveGIF() can already save files as .gif images and .mp4 movies, where anim.plot() can also save plots to .swf Shockwave Flash animations, and .html and .latex files (I presume in base64 encoding)

The animations that are saved loop infinitely.

First, check your working directory. This is where all your .gif files are going to be saved to. By default it WILL overwrite files with the original file’s name.


### Save an animation
tmp <- anim.plot(1:10, 1:10, pch=1:10, cex=3, show=FALSE)
anim.save(tmp, "mygif.gif")

## You can save outputs of replays as well.
anim.save(replay(all3), "all3.gif")
anim.save(replay(myAnim, after=doboth(1,0)), "doboth.gif")

Other possible animations

Try these examples, or copy/paste then from the help() files

help(anim.barplot) ## Get the details directly
example(anim.barplot, give.lines=TRUE) # Get the example code without running it.

example(anim.barplot) # Run the example code

For full details, see:

Tuesday, 16 April 2019

President's Trophy - A Curse by Design

There are lots of ways the NHL rewards failure and punishes excellence, like the player draft, ever shrinking salary caps, and the half-win award for participating in overtime, but even the way in which playoff pairings are decided has a perverse incentive.

There are three rewards to doing well in the NHL regular season:
-         1. Going to the playoffs,
-          2. A favorable first round pairing in said playoffs,
-          3. Home team advantage for playoff games more often.

Here I argue that the first-round pairings are not as favorable as they could be. 

Tuesday, 9 April 2019

Natural Language Processing in R: Edit Distance

These are the notes for the second lecture in the unit on text processing. Some useful ideas like exact string matching and the definitions of characters and strings are covered in the notes of Natural Language Processing in R: Strings and Regular Expressions

Edit distance, also called Levenshtein distance, is a measure of the number of primary edits that would need to be made to transform one string into another. The R function adist() is used to find the edit distance.

adist("exactly the same","exactly the same") # edit distance 0 
adist("exactly the same","totally different") # edit distance 14

Natural Language Processing in R: Strings and Regular Expressions.

In this post, I go through a lesson in natural language processing (NLP), in R. Specifically, it covers how strings operate in R, how regular expressions work in the stringr package by Hadley Wickham, and some exercises. Included with the exercises are a list of expected hang-ups, as well as an R function that can quickly check the solutions.

This lesson is designed for a 1.5-2 hour class for senior undergrads.

  • Strings in R
    • Strings can be stored and manipulated in a vector
    • Strings are not factors
    • Escape sequences
    • The str_match() function
  • Regular expressions in R
    • Period . means 'any'
    • Vertical line | means 'or'
    • +, *, and {} define repeats
    • ^ and $ mean 'beginning with' and 'ending with'
    • [] is a shortcut for 'or'
    • hyphens in []
    • example: building a regular expression for phone numbers 
  • Exercises
    • Detect e-mail addresses
    • Detect a/an errors
    • Detect Canadian postal codes

Sunday, 7 April 2019

Writing R documentation, simplified

A massive part of statistical software development is the documentation. Good documentation is more than just a help file, it serves as commentary on how the software works, includes use cases, and cites any relevant sources.

One cool thing about R documentation is that it uses a system that allows it to be put into a variety of different formats while only needing to be written once.

Monday, 1 April 2019

Bingo analysis, a tutorial in R

I'm toying with the idea of writing a book about statistical analyses of classic games. The target audience would be mathematically interested laypeople, much like Jeffrey Rosenthal's book Struck by Lightning ( https://www.amazon.ca/Struck-Lightning-Jeffrey-S-Rosenthal/dp/0006394957 ).

The twist would be that chapter would contain step-by-step R code or Python code so that the reader could do the same analysis and make changes based on their own questions. Material would like this post on Bingo, as well as my previous post on Snakes and Ladders ( https://www.stats-et-al.com/2017/11/snakes-and-ladders-and-transition.html ).

There would also be some work on chess variants, othello, poker, and possibly go, mahjong, and pente. Tied to each analysis could be light lessons on statistics. This Bingo analysis involves Monte Carlo style simulation, as well as notes on computing expected values, CDFs and PDFs.

Tuesday, 26 March 2019

Dataset - The Giant Marmots of Moscow

Stat 403/640/890 Analysis Assignment 3: Polluted Giant Marmots
Due Wednesday, April 3rd
Drop off in the dropbox by the stats workshop, or hand in in class.

For this assignment, use the Marmots_Real.csv dataset.

Main goal: The giant marmots of Moscow have a pollution problem. Find a model to predict the pollutant concentration (mg per kg) in the local population without resorting to measuring it directly. (It turns out that measuring this pollutant requires some invasive measures like looking at bone marrow).

The dataset Marmots_real.csv has the data from 60 such marmots, including many variables that are easier measure:

Variable Name
Categorical, Unordered
One of five species of giant marmot
Categorical, Unordered
One of five regions around Moscow where the subject is captured
Numerical, Continuous
Age in years
Numerical, Continuous
Longitude, recoded to (0,1000), of capture
Numerical, Continuous
Latitude, recoded to (0,1000), of capture
Numerical, Continuous
Length nose to tail in cm
Numerical, Continuous
Width between front paws, outstretched
M or F
Numerical, Count
Number of skin lesions (cuts, open sores) found upon capture
0 or 1, 1 if substantial injury was observed upon capture.
Categorical, Ordered
Condition of teeth upon capture, listed as Very Bad, Bad, Average, or Good.
Numerical, Continuous
Mass of subject in 100g
Numerical, Continuous
Count of CD4 antibody in blood per mL
Numerical, Continuous
mg/kg of selenium found in bone marrow

There are no sampling weights. There is no missing data. There should be little to no convergence or computational issues with this data.

Assignment parts:
1) Build at least three models for pollutant and compare them (e.g. r-squared, general parsimony). Be sure to try interactions and polynomial terms.

Select one to be your ‘model to beat’.

2) Check the diagnostics of your model to beat. Specifically, normality of residuals, influential outliers, and the Variance Inflation Factor. Comment.

3) Try a transformation of the response in your model to beat, and see if you can improve the r-squared.

4) Try a PCA-based model and see if it comes close to you model to beat.

5) Take your ‘model to beat’ and add some terms to it. Call this the ‘full model’, and use that as a basis for model selection using stepwise and the AIC criterion. Is the stepwise-produced model better (r-squared, AIC) than your ‘model to beat’?

6) If you haven’t already, try a random effect of something appropriate, and see if it beats the AIC of the stepwise model. Use the AIC() function to see the AIC of most models.

Useful sample code:

######## Preamble / Setup
## Load the .csv file into R. Store it as 'dat'
dat = read.csv("marmots_real.csv")
dat$region = as.factor(dat$region)

library(car) # for vif() and boxcox()
library(MASS) # for stepAIC()
library(ks) # for kde()
library(lme4) # lmer and glmer

##### Try some models, with interactions
### Saturated. Not enough DoF
mod = lm(antibody ~ species*region*age*weight + long_cm, data=dat)

### Another possibility, enough DoF, but efficient?
mod = lm(antibody ~ species + region + age*weight*long_cm, data=dat)

### With polynomials
mod = lm(pollutant ~ age + long_cm + wide_cm + I(sqrt(age)) + I((long_cm*wide_cm)^3), data=dat)
summary(mod) ## High r-sq and little significance? How?
vif(mod) ## Oh, that's how.

#### Model selection,
mod_full = lm(antibody ~ species + region + age*weight*long_cm + I(log(wide_cm)) + lesions, data=dat)

### Stepwise selection based on AIC

### What if we do a BIC penalty
## Try without trace=FALSE so we can see what's going on.
stepAIC(mod_full, k=log(nrow(dat)))

######### Transformations
### Start with the classic tranforms

### Another possibility, enough DoF, but efficient?
mod = lm(sqrt(pollutant) ~ species + region + age*weight*long_cm, data=dat)

mod = lm(log(pollutant) ~ species + region + age*weight*long_cm, data=dat)

### Box-cox to find the range of best ones through
boxcox(pollutant ~ species + region + age*weight*long_cm, data = dat,
       lambda = seq(-2, 3, length = 30))
boxcox(antibody ~ species + region + age*weight*long_cm, data = dat,
       lambda = seq(-2, 3, length = 30))
### Anything above the 95% line is perfectly fine. Anything close is probably fine too.
### Reminder:
### Lambda = -1 is 1/x (inverse) transform
### lambda = 0 is log tranform
### Lambda = 1/2 is sqrt tranform
### Lambda = 1 is no tranform
### Lambda = 2 is square transform

### First, Are the two responses related?
cor(dat$antibody, dat$pollutant)
plot(dat$antibody, dat$pollutant)

### Start with the simple ANOVAs
mod_anti = lm(antibody ~ species + region + age*weight*lesions, data=dat)
mod_poll = lm(pollutant ~ species + region + age*weight*lesions, data=dat)
aov_anti = anova(mod_anti)

aov_poll = anova(mod_poll)

### Now try the multiple ANOVA
aov_mult = manova(cbind(antibody, pollutant) ~ species + region + age*weight*lesions)
summary(aov_mult) ### Your job: Make a model that balances simplicity with fit.
                        ## Residual standard errors: Lower = better fit


### convert the relevant categorical variables
dat$teeth_num = as.numeric(factor(x = dat$teeth_condition, levels=c("Very Bad","Bad","Average","Good")))
dat$sex_num = as.numeric(factor(x = dat$sex, levels=c("F","M")))

PCA_all = prcomp( ~ age + weight + lesions + long_cm + wide_cm + injured + teeth_num + sex_num,
     data = dat,
     scale = TRUE)

plot(PCA_all, type="lines")

### Add the Principal components to the marmots dataset         
dat = cbind(dat, PCA_all$x)

### Try a few models of the responses using the PCAs
mod_PCA1 = lm(antibody ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6, data=dat)

mod_PCA2 = lm(antibody ~ PC1 * PC2 * PC3 , data=dat)

mod_PCA3 = lm(antibody ~ PC1 + PC2 + PC3 , data=dat)

mod_PCA4 = lme(pollutant ~ PC1 + PC2 + PC3 + (1|region), data=dat)
summary(mod_PCA4)  ### Why non-zero correlations? Adjustments for region were made

### Fixed vs Random

marmots$region = as.factor(marmots$region)

summary(lm(pollutant ~ region, data=dat))
summary(lmer(pollutant ~ (1|region), data=dat))
summary(lmer(pollutant ~ PC1 + PC2 + (age|region), data=dat))

summary(lmer(pollutant ~  age + (1|region), data=dat))$logLik ### Higher LogLik is better
summary(lmer(pollutant ~  age + (1|region), data=dat))$AICtab ### Lower REML (AIC calculated by REML) is better

### Compare the $AICtab value to the result from stepAIC