Featured post

Textbook: Writing for Statistics and Data Science

If you are looking for my textbook Writing for Statistics and Data Science here it is for free in the Open Educational Resource Commons. Wri...

Friday 27 July 2018

Stat Writing Exercise - Pre-Baked Regression Analysis

In this Statistical Communication exercise, the learners take an already completed regression analysis and write a report of 250-400 words describing the analysis. This exercise consists of a 40-50 minute example that the teacher goes through to demonstrate and establish expectations, followed by a 50-70 minute period for the learners to emulate that writing process on a new analysis.



The example analysis is a logistic regression model selected through a stepwise process. To save time, the stepwise process can be skipped as well as an explanation of any variables not used in the final model. The data is from UCI's Breast Cancer dataset, found here (
https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer/breast-cancer.names


The graphs and R output from the analysis is as follows:

Figure list:
### Figure 1: First six rows of the dataset
head(dat)

### Figure 2: Summary statistics of the dataset
summary(dat)

### Figure 3, summary of full model
mod_full = glm(Class ~ ., family="binomial", data=dat)
summary(mod_full)

### Figure 4, odds ratios full model
round(exp(mod_full$coef),3)


### Figure 5, summary of stepwise fitted model
mod_step <- stepAIC(fit, direction="both")
summary(mod_step)

### Figure 6, odds ratios of stepwise
round(exp(mod_step$coef),3)

### Figure 7: Normal Quantile-Quantile Plot
### Figure 8: Leverage Plot
plot(mod_step) 











Use the following checklist as a template.
1. What is the purpose of the analysis?
Specifically: What is the response variable, and what are we trying to do with it?
In this example: Trying to classify recurrence or non-recurrence of cancer using the other variables available. Since this is a binary, we will use logistic regression.

2. What is the relevant data available for this analysis?
Specifically, how many cases do we have? What format are the variables in? What are the most common responses and what are their distributions?
In this example:  Recurrence within five years (categorical, binary, response), Age (categorized by decade), Pre or Post Menopause, etc.

3. What sort of data preparation was done?
Specifically, what was done to the data in order to make the analysis possible?
In this example: We removed cases for which some of the data was missing, and cases that were in rare categories, such as age < 20 and age > 70.

4. What is the model that was used?
In this example: A logistic model of three variables.

5. How did you come up with this model?
In this example: With stepwise regression, working in both directions, optimizing on AIC.

6. What are some key features of the model?
Specifically: Draw by hand a mock-up of what a table would look like when describing the model.
What are the significant variables? What does their effect size mean? Use the general-example-exception principle to tell a story about a typical case or a typical effect size.

7. How well does this model work?
Specifically: Use the summary information and the diagnostic plots to explain how well the model assumptions fit and how well the model performs.


Example written as demonstration


1. What is the purpose of the analysis?
To predict if there’s a recurrence of cancer (e.g. in the next five years)
Predicting a binary thing (logistic regression, classification)

2. What is the relevant data available for this analysis?


We have a dataset from UCI, of 10 variables of the breast cancer history of 256 women. These variables include

‘Class’ a binary response variable of recurrence or return of cancer within 5 years,
‘Age’ as a categorical variable (10-19, 20-29, 30-39, … , 90-99),
menopausal status (binary),
tumor size in mm (categorical 0-4, 5-9, … , 50-54),
number of tumor nodes (categorical 0-2, 3-5, 6-8, …),
whether the nodes are capped (binary),
degree of malignancy (categorical 1,2,3),
breast (binary),
quadrant (categorical, 5 categories),
radiation therapy used (binary).
 

3. What sort of data preparation was done?

Before analysis, we removed cases with missing data, and those with rare categories (e.g. ages less than 20 or more than 70, menopausal before 40)

4. What is the model that was used?
We attempted a full model, using every explanatory variable in the dataset. This model, however, was very difficult to interpret and of little use to clinics and hospitals. Only one parameter (malignancy 3 vs. malignancy 1) was statistically significant. Due to these difficulties we opted for a simpler model instead
Log-Odds (Class) as a function of (number of nodes), (malignancy), and (radiation usage).

5. How did you come up with this model?
We started with a full model of all 9 variables without any interactions, and we used stepwise variable selection optimized on the Akaike Information Criterion to come with a simpler model.

6. What are some key features of the model?
 
Residual deviance is 263 compared to a null deviance of 310, which equates roughly to an r-squared of 1 – (263/310) = 0.15. So this model does not predict very well whether a recurrence happens or not. It should be noted that the full model only has an r-squared equivalent to 0.20.


However, we do have some useful indicators: The number of nodes matters substantially,

The odds of recurrence are 2.74 (CI: 2-4) and 2.90 (CI 2.1 – 4.2) times as high for women with 3-5 and 6-8 nodes respectively, when compared to those having 0-2 nodes, holding other variables constant. When there are 9 or more nodes, the odds of recurrence are 6.42 (CI 5-10) times as high.

The odds are recurrence for malignancy 1 and 2 are about the same, but the odds are about 4 times as high for those who stage 3 cancer (malignancy 3).
The odds of recurrence in radiation was used are unclear due a large standard error.


7. How well does this model work?
 
A normal quantile-quantile plot reveals that there is a major break from normality in the residuals. We are not concerned about this because of the binary nature of the responses. Furthermore, we would expect leaps in the Q-Q plot because every variable we used is categorical, so a smooth progression is nearly impossible.

A leverage plot does not reveal any outliers or overly leveraged points. There are two notable cases with leverage that are potentially influential on the model, however neither of these is deviant from the model as a whole.



Exercise Portion


Comment: The 'exercise' dataset is the 'trees' data from the datasets package in base R. This analysis is simpler than the example; it's a linear regression rather than a logistic one, and the model is pre-selected rather than selected through a stepwise process. There are still some twists: specifically, the 'species' category is meaningless (this is mentioned in the documentation), the model includes a polynomial term, and although the model fits reasonably well, there will be some diagnostic issues because the model is missapplied - volume should scale with height TIMES girth-squared, not height PLUS girth-squared.



Dataset information


This data set provides measurements of the girth, height and volume of timber in 31 felled black cherry trees. Note that girth is the diameter of the tree (in inches) measured at 4 ft 6 in above the ground.

A data frame with 31 observations on 3 variables.

[,1]     Girth  numeric        Tree diameter in inches (1 inch = 2.5 cm)
[,2]     Height            numeric        Height in ft (1 foot = 12 inches = 30 cm)
[,3]     Volume         numeric        Volume of timber in cubic ft
[,4]     Species          categorical   Made up entirely

Source
Ryan, T. A., Joiner, B. L. and Ryan, B. F. (1976) The Minitab Student Handbook. Duxbury Press.

Figure list:
### Figure 1: Raw Data of trees
trees

### Figure 2: Summary information
summary(trees)

### Figure 3: Summary of polynomial model
mod_poly = lm(Volume ~ Species + Girth + Girth^2 + Height, data=trees)
summary(mod_poly)


### Figure 4: Predicted vs Actual
plot(predict(mod_poly) ~ trees$Volume)


### Figure 5: Predicted vs Residual
plot(mod_poly$resid ~ predict(mod_poly))


### Figure 6: Normal Quantile-Quantile Plot
### Figure 7: Leverage Plot
plot(mod_step) 









Checklist


1. What is the purpose of the analysis?
Specifically: What is the response variable, and what are we trying to do with it?

2. What is the relevant data available for this analysis?
Specifically, how many cases do we have? What format are the variables in? What are the most common responses and what are their distributions?
 
3. What is the model that was used?

4. What are some key features of the model?
Specifically: Draw by hand a mock-up of what a table would look like when describing the model.

What are the significant variables? What does their effect size mean? Use the general-example-exception principle to tell a story about a typical case or a typical effect size.

5. How well does this model work?

No comments:

Post a Comment