The following is a
report on the reproduction of the statistical work in the paper “Differences of
Type I error rates for ANOVA and MultilevelLinearModels using SAS and SPSS
for repeated measures designs" by Nicolas Haverkamp and
AndrĂ© Beauducel at the University of Bonn.
A blog about teaching and programming statistics, scientific writing, sports analytics, and game design by Dr. Jack Davis.
Wednesday, 14 August 2019
Thursday, 8 August 2019
Reversi in R  Part 2: Graphics and Custom Boards
In this
post, I finish the Reversi / Othello game in R by improving the graphics, adding
the ability to save and load boards, and fixing bugs. Also, many more boards
have been added and tested, including those with unusual shapes, three or more players,
and walls that can make the board into unusual shapes or even break it in half.
Wednesday, 7 August 2019
Reading questions: Struck by Lightning
The book Struck by Lightning, by Jeffery Rosenthal hits that
balance of scientific correctness and approachability just right for a general audience
book on probability. It’s been in print for 14 years now, and was a Canadian
bestseller, so there’s nothing new that would come from a traditional review
and therefore I won’t write one.
Instead, two points:
1. It should be required K12 or
100level math/stats reading.
Sunday, 21 July 2019
Reversi in R  Part 1: Bare Bones
In this
post, I showcase a barebones pointandclick implementation of the classic
board Reversi (also called Othello*) in the R programming language. R is
typically used for more serious, statistical endeavors, but it works reasonably
well for more playful projects. Building a classic game like this is an
excellent highschool level introduction to programming, as well as a good basis
for building and testing game AI.
Tuesday, 2 July 2019
I read this: New Rules for Classic Games
New Rules for Classic Games, by R. Wayne Schmittberger,
written in 1992, is exactly what it sounds like. "New Rules" contains
possible amendments to rules for Risk, Monopoly, Poker, Bridge, Scrabble,
Reversi/Othello*, Shogi, Go, and of course Chess.
Saturday, 22 June 2019
Annual Report to Stakeholders 201819
Every year in grad school I had to write a report on my research and academic progress. I found it a useful exercise so I've continued to do so as a faculty member and postdoc.
Summary:
Professionally, this year was a struggle just to keep my head above water. I expect next year to be more productive in general, as well as more research oriented.
Thursday, 20 June 2019
Replication Report  Informative Priors and Bayesian Updating
The following is a
report on the reproduction of the statistical work in the paper "The use of informative priors and Bayesian updating:
implications for behavioral research" by Charlotte O. Brand et al.
The
original paper was accepted for publication by MetaPsychology, https://open.lnu.se/index.php/metapsychology, a journal focused on methodology and reproductions of existing work.
This report continues my attempt to establishing a standard for replication
reports according to the Psych Data Standards found here https://github.com/psychds/psychDS
Tuesday, 28 May 2019
Package Spotlight: anim.plots
The package anim.plots behaves like a sort of userfriendly shell on top of animate that makes animations of some of the most common types of plots in base R in a more intuitive fashion that animate.
This package depends on two other important packages:
 magick, which is an R implementation of imageMagick, which itself is software used to create animated gifs from still images.
 animation, which is an R library that can be used to create animations from any collection of plots.
Tuesday, 16 April 2019
President's Trophy  A Curse by Design
There are lots of ways the NHL rewards failure and punishes excellence, like the player draft, ever shrinking salary caps, and the halfwin award for participating in overtime, but even the way in which playoff pairings are decided has a perverse incentive.
There are three rewards to doing well in the NHL regular season:
 1. Going to the playoffs,

2. A favorable first round pairing
in said playoffs,

3. Home team advantage for playoff
games more often.
Here I argue that the firstround pairings
are not as favorable as they could be.
Tuesday, 9 April 2019
Natural Language Processing in R: Edit Distance
These are the notes for the second lecture in the unit on text processing. Some useful ideas like exact string matching and the definitions of characters and strings are covered in the notes of Natural Language Processing in R: Strings and Regular Expressions
Edit distance, also called Levenshtein distance, is a measure of the number of primary edits that would need to be made to transform one string into another. The R function adist() is used to find the edit distance.
adist("exactly the same","exactly the same") # edit distance 0 adist("exactly the same","totally different") # edit distance 14
Natural Language Processing in R: Strings and Regular Expressions.
In this post, I go through a lesson in natural language processing (NLP), in R. Specifically, it covers how strings operate in R, how regular expressions work in the stringr package by Hadley Wickham, and some exercises. Included with the exercises are a list of expected hangups, as well as an R function that can quickly check the solutions.
This lesson is designed for a 1.52 hour class for senior undergrads.
Contents:
This lesson is designed for a 1.52 hour class for senior undergrads.
Contents:
 Strings in R
 Strings can be stored and manipulated in a vector
 Strings are not factors
 Escape sequences
 The str_match() function
 Regular expressions in R
 Period . means 'any'
 Vertical line  means 'or'
 +, *, and {} define repeats
 ^ and $ mean 'beginning with' and 'ending with'
 [] is a shortcut for 'or'
 hyphens in []
 example: building a regular expression for phone numbers
 Exercises
 Detect email addresses
 Detect a/an errors
 Detect Canadian postal codes
Sunday, 7 April 2019
Writing R documentation, simplified
A massive part of statistical software development is the documentation. Good documentation is more than just a help file, it serves as commentary on how the software works, includes use cases, and cites any relevant sources.
One cool thing about R documentation is that it uses a system that allows it to be put into a variety of different formats while only needing to be written once.
Monday, 1 April 2019
Bingo analysis, a tutorial in R
I'm toying with the idea of writing a book about statistical analyses of classic games. The target audience would be mathematically interested laypeople, much like Jeffrey Rosenthal's book Struck by Lightning ( https://www.amazon.ca/StruckLightningJeffreySRosenthal/dp/0006394957 ).
The twist would be that chapter would contain stepbystep R code or Python code so that the reader could do the same analysis and make changes based on their own questions. Material would like this post on Bingo, as well as my previous post on Snakes and Ladders ( https://www.statsetal.com/2017/11/snakesandladdersandtransition.html ).
There would also be some work on chess variants, othello, poker, and possibly go, mahjong, and pente. Tied to each analysis could be light lessons on statistics. This Bingo analysis involves Monte Carlo style simulation, as well as notes on computing expected values, CDFs and PDFs.
Tuesday, 26 March 2019
Dataset  The Giant Marmots of Moscow
Stat 403/640/890 Analysis Assignment 3: Polluted Giant Marmots
Due Wednesday, April 3^{rd}
Drop off in the dropbox by the stats workshop, or hand in in class.
For this assignment, use the Marmots_Real.csv dataset.
Main goal: The giant marmots of Moscow have a pollution problem. Find a model to predict the pollutant concentration (mg per kg) in the local population without resorting to measuring it directly. (It turns out that measuring this pollutant requires some invasive measures like looking at bone marrow).
The dataset Marmots_real.csv has the data from 60 such marmots, including many variables that are easier measure:
Variable Name  Type  Description 
Species  Categorical, Unordered  One of five species of giant marmot 
Region  Categorical, Unordered  One of five regions around Moscow where the subject is captured 
Age  Numerical, Continuous  Age in years 
Pos_x  Numerical, Continuous  Longitude, recoded to (0,1000), of capture 
Pos_y  Numerical, Continuous  Latitude, recoded to (0,1000), of capture 
Long_cm  Numerical, Continuous  Length nose to tail in cm 
Wide_cm  Numerical, Continuous  Width between front paws, outstretched 
Sex  Binary  M or F 
Lesions  Numerical, Count  Number of skin lesions (cuts, open sores) found upon capture 
Injured  Binary  0 or 1, 1 if substantial injury was observed upon capture. 
Teeth_Condition  Categorical, Ordered  Condition of teeth upon capture, listed as Very Bad, Bad, Average, or Good. 
Weight  Numerical, Continuous  Mass of subject in 100g 
Antibody  Numerical, Continuous  Count of CD4 antibody in blood per mL 
Pollutant  Numerical, Continuous  mg/kg of selenium found in bone marrow 
There are no sampling weights. There is no missing data. There should be little to no convergence or computational issues with this data.
Assignment parts:
1) Build at least three models for pollutant and compare them (e.g. rsquared, general parsimony). Be sure to try interactions and polynomial terms.
Select one to be your ‘model to beat’.
2) Check the diagnostics of your model to beat. Specifically, normality of residuals, influential outliers, and the Variance Inflation Factor. Comment.
3) Try a transformation of the response in your model to beat, and see if you can improve the rsquared.
4) Try a PCAbased model and see if it comes close to you model to beat.
5) Take your ‘model to beat’ and add some terms to it. Call this the ‘full model’, and use that as a basis for model selection using stepwise and the AIC criterion. Is the stepwiseproduced model better (rsquared, AIC) than your ‘model to beat’?
6) If you haven’t already, try a random effect of something appropriate, and see if it beats the AIC of the stepwise model. Use the AIC() function to see the AIC of most models.
Useful sample code:
######## Preamble / Setup
## Load the .csv file into R. Store it as 'dat'
dat = read.csv("marmots_real.csv")
dat$region = as.factor(dat$region)
library(car) # for vif() and boxcox()
library(MASS) # for stepAIC()
library(ks) # for kde()
library(lme4) # lmer and glmer
##### Try some models, with interactions
### Saturated. Not enough DoF
mod = lm(antibody ~ species*region*age*weight + long_cm, data=dat)
summary(mod)
### Another possibility, enough DoF, but efficient?
mod = lm(antibody ~ species + region + age*weight*long_cm, data=dat)
summary(mod)
vif(mod)
plot(mod)
AIC(mod)
### With polynomials
mod = lm(pollutant ~ age + long_cm + wide_cm + I(sqrt(age)) + I((long_cm*wide_cm)^3), data=dat)
summary(mod) ## High rsq and little significance? How?
vif(mod) ## Oh, that's how.
#### Model selection,
mod_full = lm(antibody ~ species + region + age*weight*long_cm + I(log(wide_cm)) + lesions, data=dat)
### Stepwise selection based on AIC
stepAIC(mod_full)
### What if we do a BIC penalty
## Try without trace=FALSE so we can see what's going on.
stepAIC(mod_full, k=log(nrow(dat)))
######### Transformations
### Start with the classic tranforms
### Another possibility, enough DoF, but efficient?
mod = lm(sqrt(pollutant) ~ species + region + age*weight*long_cm, data=dat)
summary(mod)
mod = lm(log(pollutant) ~ species + region + age*weight*long_cm, data=dat)
summary(mod)
### Boxcox to find the range of best ones through
boxcox(pollutant ~ species + region + age*weight*long_cm, data = dat,
lambda = seq(2, 3, length = 30))
boxcox(antibody ~ species + region + age*weight*long_cm, data = dat,
lambda = seq(2, 3, length = 30))
### Anything above the 95% line is perfectly fine. Anything close is probably fine too.
### Reminder:
### Lambda = 1 is 1/x (inverse) transform
### lambda = 0 is log tranform
### Lambda = 1/2 is sqrt tranform
### Lambda = 1 is no tranform
### Lambda = 2 is square transform
#################
### MANOVA
### First, Are the two responses related?
cor(dat$antibody, dat$pollutant)
plot(dat$antibody, dat$pollutant)
### Start with the simple ANOVAs
mod_anti = lm(antibody ~ species + region + age*weight*lesions, data=dat)
mod_poll = lm(pollutant ~ species + region + age*weight*lesions, data=dat)
aov_anti = anova(mod_anti)
aov_anti
summary(aov_anti)
aov_poll = anova(mod_poll)
aov_poll
summary(aov_poll)
### Now try the multiple ANOVA
aov_mult = manova(cbind(antibody, pollutant) ~ species + region + age*weight*lesions)
aov_mult
summary(aov_mult) ### Your job: Make a model that balances simplicity with fit.
## Residual standard errors: Lower = better fit
###################
# PCA
### convert the relevant categorical variables
dat$teeth_num = as.numeric(factor(x = dat$teeth_condition, levels=c("Very Bad","Bad","Average","Good")))
dat$sex_num = as.numeric(factor(x = dat$sex, levels=c("F","M")))
PCA_all = prcomp( ~ age + weight + lesions + long_cm + wide_cm + injured + teeth_num + sex_num,
data = dat,
scale = TRUE)
summary(PCA_all)
plot(PCA_all, type="lines")
### Add the Principal components to the marmots dataset
dat = cbind(dat, PCA_all$x)
head(dat)
### Try a few models of the responses using the PCAs
mod_PCA1 = lm(antibody ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6, data=dat)
summary(mod_PCA1)
mod_PCA2 = lm(antibody ~ PC1 * PC2 * PC3 , data=dat)
summary(mod_PCA2)
mod_PCA3 = lm(antibody ~ PC1 + PC2 + PC3 , data=dat)
summary(mod_PCA3)
mod_PCA4 = lme(pollutant ~ PC1 + PC2 + PC3 + (1region), data=dat)
summary(mod_PCA4) ### Why nonzero correlations? Adjustments for region were made
### Fixed vs Random
marmots$region = as.factor(marmots$region)
summary(lm(pollutant ~ region, data=dat))
summary(lmer(pollutant ~ (1region), data=dat))
summary(lmer(pollutant ~ PC1 + PC2 + (ageregion), data=dat))
summary(lmer(pollutant ~ age + (1region), data=dat))$logLik ### Higher LogLik is better
summary(lmer(pollutant ~ age + (1region), data=dat))$AICtab ### Lower REML (AIC calculated by REML) is better
### Compare the $AICtab value to the result from stepAIC
Saturday, 23 March 2019
What makes a good data dictionary?
A data dictionary is a guide, external to the dataset in question, that explains what each variable is in a humanreadable format. In R, the programming equivalent of a data dictionary is the result of the str() function, which will show the first few values of each variable, the format (e.g. numerical, string, factor), and other important information (e.g. the first few levels of the factor).
A data dictionary may include softwarespecific features, but it should still be value to anyone using the dataset, regardless of the software they are using.
Other than that, what makes a good data dictionary?
Thursday, 21 February 2019
Reading Assignments  SplitPlot Design, MagnitudeBased Inference
This semester, I'm teaching a new (to me) course that's heavy into design of experiments and biostatistics, which means I needed some new reading assignments. First, a survey of applications of splitplot designs for fisheries. Next, a seminal paper on magnitudebased inference, written for physiologists. Nonpaywalled links to the papers included.
Sunday, 17 February 2019
Alternatives to the P value
Here are some things you can find and report as alternatives
to the pvalue: confidence intervals, Bayes factors, and magnitudebased interference. These are used mostly the same situations, but all of them are
more informative especially in combination with the P value.
Tuesday, 12 February 2019
Lingering questions from the 2018 MLB season
Here's a few more comments and ongoing questions about Major League Baseball that I wanted to post but didn't fit anywhere else. Just a few more weeks until spring training!
Inside: Pitch count superstitions, base coach evaluation, WAR in blowouts, and anecdotes of SafeCo.
Inside: Pitch count superstitions, base coach evaluation, WAR in blowouts, and anecdotes of SafeCo.
The Vestigial Reference Letter
The graduate school reference letter is a holdover from a time when academics was much more of a closed off clique than the relatively open network of today. Back then, references served to keep the world of higher education reserved for insiders. Nowadays, these letters are vestigial and purposeless; they only serve to waste the time of applicants, referees, and admissions people.
Friday, 8 February 2019
Replication Report  Signal Detection Analysis
The following is a report on the reproduction of the statistical work in the paper “Insights into Criteria for Statistical Significance from Signal Detection Analysis” by Jessica K. Witt at Colorado State University.
The original paper was accepted for publication by the Journal of MetaPsychology, a journal focused on methodology and reproductions of existing work. This report on is my first attempt at establishing a standard template for future such reports according to the Psych Data Standards found here https://github.com/psychds/psychDS
The original paper was accepted for publication by the Journal of MetaPsychology, a journal focused on methodology and reproductions of existing work. This report on is my first attempt at establishing a standard template for future such reports according to the Psych Data Standards found here https://github.com/psychds/psychDS
Saturday, 26 January 2019
I read this: Smart Baseball
Smart Baseball, by Keith Law, exposes many common baseball statistics like batting average for the garbage they are. He does this in an approachable, narrative style that will appeal to longtime baseball fans. It's very light on statistics, but Law obviously knows his stuff; he just isn't interested in showing off the details.
Friday, 25 January 2019
Venn Diagram Stats Memes
Statistics is great but it doesn't have the best meme portfolio, at least not since Biostatistics Ryan Goslling.
Let's try to fix that.
More after the break.
Let's try to fix that.
More after the break.
Sunday, 13 January 2019
Statistical Vocabulary Crossword Puzzles
I've expanded the statistical thesaurus again, as well as made some crossword puzzles from them, for vocabulary study.
Subscribe to:
Posts (Atom)