Lately, I’ve been having trouble finding
just the right dataset for a particularly challenging course I’m teaching. I
wanted a dataset in which…
 The main task was inference of regression output
 There were multiple meaningful response variables
 Complications like sampling weights and missing data were minimal.
 There were variables that could be considered random effects.
 There were enough numerical variables to attempt a Principal Component Analysis
 Interactions, Polynomial terms, and Transformations were meaningful approaches.
 There wasn’t so many variables that this came down to model selection or some big data method.
I couldn’t find one, so I made my own.
I present… the Giant Marmots of Moscow!
Dataset 1 – 100 Values (The ‘test’ set)
Dataset 2 – 60 Values (The ‘training’
set)
Generating code (Embargoed until at least
April 4^{th})
Example analysis assignment – For a grad
level statistics service course.
For this assignment, use the Marmots_60.csv
dataset.
Main goal: The giant marmots of Moscow have a
pollution problem. Find a model to predict the pollutant concentration (mg per
kg) in the local population without resorting to measuring it directly. (It
turns out that measuring this pollutant requires some invasive measures like
looking at bone marrow).
The dataset Marmots_real.csv has the data from
60 such marmots, including many variables that are easier measure:
Variable
Name

Type

Description

Species

Categorical,
Unordered

One of
five species of giant marmot

Region

Categorical,
Unordered

One of
five regions around Moscow where the subject is captured

Age

Numerical,
Continuous

Age in
years

Pos_x

Numerical,
Continuous

Longitude,
recoded to (0,1000), of capture

Pos_y

Numerical,
Continuous

Latitude,
recoded to (0,1000), of capture

Long_cm

Numerical,
Continuous

Length
nose to tail in cm

Wide_cm

Numerical,
Continuous

Width between
front paws, outstretched

Sex

Binary

M or F

Lesions

Numerical,
Count

Number of
skin lesions (cuts, open sores) found upon capture

Injured

Binary

0 or 1, 1
if substantial injury was observed upon capture.

Teeth_Condition

Categorical,
Ordered

Condition
of teeth upon capture, listed as Very Bad, Bad, Average, or Good.

Weight

Numerical,
Continuous

Mass of
subject in 100g

Antibody

Numerical,
Continuous

Count of
CD4 antibody in blood per mL

Pollutant

Numerical,
Continuous

mg/kg of
selenium found in bone marrow

There are
no sampling weights. There is no missing data. There should be little to no
convergence or computational issues with this data.
Assignment
parts:
1) Build
at least three models for pollutant and compare them (e.g. rsquared, general
parsimony). Be sure to try interactions and polynomial terms.
Select one
to be your ‘model to beat’.
2) Check
the diagnostics of your model to beat. Specifically, normality of residuals, influential
outliers, and the Variance Inflation Factor. Comment.
3) Try a
transformation of the response in your model to beat, and see if you can
improve the rsquared.
4) Try a
PCAbased model and see if it comes close to you model to beat.
5) Take
your ‘model to beat’ and add some terms to it. Call this the ‘full model’, and
use that as a basis for model selection using stepwise and the AIC criterion.
Is the stepwiseproduced model better (rsquared, AIC) than your ‘model to
beat’?
6) If you
haven’t already, try a random effect of something appropriate, and see if it
beats the AIC of the stepwise model. Use the AIC() function to see the AIC of
most models.
Useful
sample code:
######## Preamble / Setup
## Load the .csv file into
R. Store it as 'dat'
dat = read.csv("marmots_real.csv")
dat$region = as.factor(dat$region)
library(car) # for vif()
and boxcox()
library(MASS) # for
stepAIC()
library(ks) # for kde()
library(lme4) # lmer and
glmer
##### Try some models,
with interactions
### Saturated. Not enough
DoF
mod = lm(antibody ~
species*region*age*weight + long_cm, data=dat)
summary(mod)
### Another possibility,
enough DoF, but efficient?
mod = lm(antibody ~
species + region + age*weight*long_cm, data=dat)
summary(mod)
vif(mod)
plot(mod)
AIC(mod)
### With polynomials
mod = lm(pollutant ~ age +
long_cm + wide_cm + I(sqrt(age)) + I((long_cm*wide_cm)^3), data=dat)
summary(mod) ## High rsq
and little significance? How?
vif(mod) ## Oh, that's
how.
#### Model selection,
mod_full = lm(antibody ~
species + region + age*weight*long_cm + I(log(wide_cm)) + lesions, data=dat)
### Stepwise selection
based on AIC
stepAIC(mod_full)
### What if we do a BIC
penalty
## Try without trace=FALSE
so we can see what's going on.
stepAIC(mod_full,
k=log(nrow(dat)))
######### Transformations
### Start with the classic
tranforms
### Another possibility,
enough DoF, but efficient?
mod = lm(sqrt(pollutant) ~
species + region + age*weight*long_cm, data=dat)
summary(mod)
mod = lm(log(pollutant) ~
species + region + age*weight*long_cm, data=dat)
summary(mod)
### Boxcox to find the
range of best ones through
boxcox(pollutant ~ species
+ region + age*weight*long_cm, data = dat,
lambda = seq(2, 3, length = 30))
boxcox(antibody ~ species
+ region + age*weight*long_cm, data = dat,
lambda = seq(2, 3, length = 30))
### Anything above the 95%
line is perfectly fine. Anything close is probably fine too.
### Reminder:
### Lambda = 1 is 1/x
(inverse) transform
### lambda = 0 is log
tranform
### Lambda = 1/2 is sqrt
tranform
### Lambda = 1 is no
tranform
### Lambda = 2 is square
transform
#################
### MANOVA
### First, Are the two
responses related?
cor(dat$antibody,
dat$pollutant)
plot(dat$antibody,
dat$pollutant)
### Start with the simple
ANOVAs
mod_anti = lm(antibody ~
species + region + age*weight*lesions, data=dat)
mod_poll = lm(pollutant ~
species + region + age*weight*lesions, data=dat)
aov_anti = anova(mod_anti)
aov_anti
summary(aov_anti)
aov_poll = anova(mod_poll)
aov_poll
summary(aov_poll)
### Now try the multiple
ANOVA
aov_mult =
manova(cbind(antibody, pollutant) ~ species + region + age*weight*lesions)
aov_mult
summary(aov_mult) ### Your
job: Make a model that balances simplicity with fit.
## Residual standard errors: Lower = better
fit
###################
# PCA
### convert the relevant
categorical variables
dat$teeth_num =
as.numeric(factor(x = dat$teeth_condition, levels=c("Very
Bad","Bad","Average","Good")))
dat$sex_num =
as.numeric(factor(x = dat$sex, levels=c("F","M")))
PCA_all = prcomp( ~ age +
weight + lesions + long_cm + wide_cm + injured + teeth_num + sex_num,
data = dat,
scale = TRUE)
summary(PCA_all)
plot(PCA_all,
type="lines")
### Add the Principal
components to the marmots dataset
dat = cbind(dat,
PCA_all$x)
head(dat)
### Try a few models of
the responses using the PCAs
mod_PCA1 = lm(antibody ~
PC1 + PC2 + PC3 + PC4 + PC5 + PC6, data=dat)
summary(mod_PCA1)
mod_PCA2 = lm(antibody ~
PC1 * PC2 * PC3 , data=dat)
summary(mod_PCA2)
mod_PCA3 = lm(antibody ~
PC1 + PC2 + PC3 , data=dat)
summary(mod_PCA3)
mod_PCA4 = lme(pollutant ~
PC1 + PC2 + PC3 + (1region), data=dat)
summary(mod_PCA4) ### Why nonzero correlations? Adjustments for
region were made
### Fixed vs Random
marmots$region =
as.factor(marmots$region)
summary(lm(pollutant ~
region, data=dat))
summary(lmer(pollutant ~
(1region), data=dat))
summary(lmer(pollutant ~
PC1 + PC2 + (ageregion), data=dat))
summary(lmer(pollutant
~ age + (1region), data=dat))$logLik
### Higher LogLik is better
summary(lmer(pollutant
~ age + (1region), data=dat))$AICtab
### Lower REML (AIC calculated by REML) is better
### Compare the $AICtab
value to the result from stepAIC
That was a lot of work. Take a break, and enjoy Opening Day! 
No comments:
Post a Comment