## Friday, 4 March 2016

### Reading Assignments - Model Selection and Missing Data

These are two more readings that were incorporated into a 3rd year stats course geared towards life and health sciences. One is a model selection paper geared towards ecologists, and the other is a paper on missing data and imputation in the context of medicine and survival analysis.

#### Model Selection

The first reading is Model Selection in Ecology and Evolution, by Jerald Johnson and Kristen Omland. It focuses on principles and sticks to hard science without getting too mathematical. Better yet, complexities and applications are set aside in shaded boxes.

I'm hoping this is the 'Goldilocks'  assignment in terms of difficulty, where the first was too easy and the second too hard.

R1, 2 pts) From the abstract (the first paragraph in bold), how is model selection used.

R2, 2 pts). What does model selection offer in contrast to a single null hypothesis test?

R3, 3 pts) What are three primary advantages of model selection?

R4, 2 pts) Name two commonly used criteria for model selection?

R5, 3 pts) Name a method to address the problem of several models all being equally (or nearly equally) viable? (In other words, if more than one model has equal support from the data). What are two advantages of using this model?

R6, 3 pts) What is a more recent application of model selection in evolutionary biology? What about model selection makes it well suited to this application?

R7, 2 pts. The authors suggest a requirement of the model being selected. This is in order to ensure the parameter estimates are biologically plausible. Describe this requirement.

For interest only, 0 pts), . What framework does model selection offer to ecosystem science?

-------------------

#### Missing Data, Imputation, and Survival Analysis

The next reading is first two sections of Multiple Imputation of Missing Blood Pressure in Survival Analysis, by Stef van Buuren.

Stef van Buuren is the author of the mice package and its accompanying textbook. This paper was selected for the same reason as the Rubin paper: An introductory explanation by a big name.

This is the lightest of the four readings, with barely four pages of required material and a set of simple questions. However, missing data is typically barely touched in this class, going from previous syllabi. I didn't want to introduce more depth than usual, as my personal touch on this course is too heavy already.

Imputation is the term used for any method that fills in missing data. Multiple imputation refers to filling in each missing data point with several plausible values.

R1. Write the regression equations for models A and B, assuming that 'mortality', 'age' and 'health' are a continuous variable, and that 'sex' is categorical.

R2. What is different about the 12.5% of cases in which blood pressure data is missing? Why is this a problem?

R3. What assumption do all methods of dealing with incomplete covariances rely on?

R4. What are the names of the four mechanisms for missing/incomplete data? Which one is the simplest? Which one is the most problematic?

R5. Describe two of the trends from Table 1 relating the missingness of blood pressure (BP) measurements to other known variables.

R6. Which missing data mechanism would you expect from the following scenarios.

R6a) Taking water samples, but with equipment that fails 10% of the time simply because it is old.

R6b) A poll of randomly selected people that asks a question based on sensitive or illegal information like 'do you use illicit drugs?'

R6c) A poll of people that have volunteered to be in the sample.

---------------------------

Here are the first two reading assignments, on causality and hypothesis test, respectively.