Tuesday, 26 March 2019

Dataset - The Giant Marmots of Moscow

Lately, I’ve been having trouble finding just the right dataset for a particularly challenging course I’m teaching. I wanted a dataset in which…

• The main task was inference of regression output
• There were multiple meaningful response variables
•  Complications like sampling weights and missing data were minimal.
• There were variables that could be considered random effects.
• There were enough numerical variables to attempt a Principal Component Analysis
• Interactions, Polynomial terms, and Transformations were meaningful approaches.
• There wasn’t so many variables that this came down to model selection or some big data method.

I couldn’t find one, so I made my own.

I present… the Giant Marmots of Moscow!

Dataset 1 – 100 Values (The ‘test’ set)
Dataset 2 – 60 Values (The ‘training’ set)
Generating code (Embargoed until at least April 4th)

Example analysis assignment – For a grad level statistics service course.

For this assignment, use the Marmots_60.csv dataset.

Main goal: The giant marmots of Moscow have a pollution problem. Find a model to predict the pollutant concentration (mg per kg) in the local population without resorting to measuring it directly. (It turns out that measuring this pollutant requires some invasive measures like looking at bone marrow).

The dataset Marmots_real.csv has the data from 60 such marmots, including many variables that are easier measure:

 Variable Name Type Description Species Categorical, Unordered One of five species of giant marmot Region Categorical, Unordered One of five regions around Moscow where the subject is captured Age Numerical, Continuous Age in years Pos_x Numerical, Continuous Longitude, recoded to (0,1000), of capture Pos_y Numerical, Continuous Latitude, recoded to (0,1000), of capture Long_cm Numerical, Continuous Length nose to tail in cm Wide_cm Numerical, Continuous Width between front paws, outstretched Sex Binary M or F Lesions Numerical, Count Number of skin lesions (cuts, open sores) found upon capture Injured Binary 0 or 1, 1 if substantial injury was observed upon capture. Teeth_Condition Categorical, Ordered Condition of teeth upon capture, listed as Very Bad, Bad, Average, or Good. Weight Numerical, Continuous Mass of subject in 100g Antibody Numerical, Continuous Count of CD4 antibody in blood per mL Pollutant Numerical, Continuous mg/kg of selenium found in bone marrow

There are no sampling weights. There is no missing data. There should be little to no convergence or computational issues with this data.

Assignment parts:
1) Build at least three models for pollutant and compare them (e.g. r-squared, general parsimony). Be sure to try interactions and polynomial terms.

Select one to be your ‘model to beat’.

2) Check the diagnostics of your model to beat. Specifically, normality of residuals, influential outliers, and the Variance Inflation Factor. Comment.

3) Try a transformation of the response in your model to beat, and see if you can improve the r-squared.

4) Try a PCA-based model and see if it comes close to you model to beat.

5) Take your ‘model to beat’ and add some terms to it. Call this the ‘full model’, and use that as a basis for model selection using stepwise and the AIC criterion. Is the stepwise-produced model better (r-squared, AIC) than your ‘model to beat’?

6) If you haven’t already, try a random effect of something appropriate, and see if it beats the AIC of the stepwise model. Use the AIC() function to see the AIC of most models.

Useful sample code:

######## Preamble / Setup
## Load the .csv file into R. Store it as 'dat'
dat\$region = as.factor(dat\$region)

library(car) # for vif() and boxcox()
library(MASS) # for stepAIC()
library(ks) # for kde()
library(lme4) # lmer and glmer

##### Try some models, with interactions
### Saturated. Not enough DoF
mod = lm(antibody ~ species*region*age*weight + long_cm, data=dat)
summary(mod)

### Another possibility, enough DoF, but efficient?
mod = lm(antibody ~ species + region + age*weight*long_cm, data=dat)
summary(mod)
vif(mod)
plot(mod)
AIC(mod)

### With polynomials
mod = lm(pollutant ~ age + long_cm + wide_cm + I(sqrt(age)) + I((long_cm*wide_cm)^3), data=dat)
summary(mod) ## High r-sq and little significance? How?
vif(mod) ## Oh, that's how.

#### Model selection,
mod_full = lm(antibody ~ species + region + age*weight*long_cm + I(log(wide_cm)) + lesions, data=dat)

### Stepwise selection based on AIC
stepAIC(mod_full)

### What if we do a BIC penalty
## Try without trace=FALSE so we can see what's going on.
stepAIC(mod_full, k=log(nrow(dat)))

######### Transformations

### Another possibility, enough DoF, but efficient?
mod = lm(sqrt(pollutant) ~ species + region + age*weight*long_cm, data=dat)
summary(mod)

mod = lm(log(pollutant) ~ species + region + age*weight*long_cm, data=dat)
summary(mod)

### Box-cox to find the range of best ones through
boxcox(pollutant ~ species + region + age*weight*long_cm, data = dat,
lambda = seq(-2, 3, length = 30))

boxcox(antibody ~ species + region + age*weight*long_cm, data = dat,
lambda = seq(-2, 3, length = 30))

### Anything above the 95% line is perfectly fine. Anything close is probably fine too.
### Reminder:
### Lambda = -1 is 1/x (inverse) transform
### lambda = 0 is log tranform
### Lambda = 1/2 is sqrt tranform
### Lambda = 1 is no tranform
### Lambda = 2 is square transform

#################
### MANOVA
### First, Are the two responses related?
cor(dat\$antibody, dat\$pollutant)
plot(dat\$antibody, dat\$pollutant)

mod_anti = lm(antibody ~ species + region + age*weight*lesions, data=dat)
mod_poll = lm(pollutant ~ species + region + age*weight*lesions, data=dat)
aov_anti = anova(mod_anti)
aov_anti
summary(aov_anti)

aov_poll = anova(mod_poll)
aov_poll
summary(aov_poll)

### Now try the multiple ANOVA
aov_mult = manova(cbind(antibody, pollutant) ~ species + region + age*weight*lesions)
aov_mult
summary(aov_mult) ### Your job: Make a model that balances simplicity with fit.
## Residual standard errors: Lower = better fit

###################
# PCA

### convert the relevant categorical variables
dat\$sex_num = as.numeric(factor(x = dat\$sex, levels=c("F","M")))

PCA_all = prcomp( ~ age + weight + lesions + long_cm + wide_cm + injured + teeth_num + sex_num,
data = dat,
scale = TRUE)

summary(PCA_all)
plot(PCA_all, type="lines")

### Add the Principal components to the marmots dataset
dat = cbind(dat, PCA_all\$x)

### Try a few models of the responses using the PCAs
mod_PCA1 = lm(antibody ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6, data=dat)
summary(mod_PCA1)

mod_PCA2 = lm(antibody ~ PC1 * PC2 * PC3 , data=dat)
summary(mod_PCA2)

mod_PCA3 = lm(antibody ~ PC1 + PC2 + PC3 , data=dat)
summary(mod_PCA3)

mod_PCA4 = lme(pollutant ~ PC1 + PC2 + PC3 + (1|region), data=dat)

### Fixed vs Random

marmots\$region = as.factor(marmots\$region)

summary(lm(pollutant ~ region, data=dat))
summary(lmer(pollutant ~ (1|region), data=dat))
summary(lmer(pollutant ~ PC1 + PC2 + (age|region), data=dat))

summary(lmer(pollutant ~  age + (1|region), data=dat))\$logLik ### Higher LogLik is better
summary(lmer(pollutant ~  age + (1|region), data=dat))\$AICtab ### Lower REML (AIC calculated by REML) is better

### Compare the \$AICtab value to the result from stepAIC

 That was a lot of work. Take a break, and enjoy Opening Day!