Saturday, 31 March 2018

Assignments for statistical literacy: Big Data in Healthcare, Data and the Law, Manual Writing

This semester, I've been trying a lot of new assignments to encourage reading and writing of statistical literature as part of a new class and in preparation for a course pack I am publishing soon. 

Here are two of the reading assignments and one of writing exercises that I tried this semester: "Data and the Law", "Big Data in Healthcare", and an exercise on writing good statistical instructions.

All of the required reading is open access.

 1. "Data and the Law" reading assignment, in which we look at some of the complications of copyright, scientific facts, and compilations.

This reading assignment pertains to “Data and the law: Beyond the sweat of the brow. Who owns published data? And what is data?” by Gerald van Belle and Leslie Ruiter, available at

Q1) What are some things that the US Copyright Act denies protection to? Name at least three.

Q2) When was it ruled that telephone numbers were not subject to copyright?

Q3) Say that you wanted to use the information in Table 1 in your own publication. Give two other ways that Table 1 could be changed in order to meet the 'modicum of originality' requirement.

Q4) What are the restrictions, if any, on making a graph using someone else's data?

Q5) Which of the three, EU, USA, and Canada is the most restrictive on copyright law pertaining to data? Which is the least?

2. "Big Data and Healthcare" reading assignment, in which we look at some of the ways in which big data is changing how hospitals operate.

The following 8 questions can be answered by reading the article “Big data analytics in healthcare: promise and potential” by Wullianallur Raghupathi and Viju Raghupathi  in Health Information Science and Systems 2014, 2:3

Available at:

Q1) What is more difficult in working with big data in healthcare? Why?

Q2) Give two examples of how big data analytics can lead to improved outcomes.

Q3) What are some developments or outcomes that can be predicted with big data?

Q4) What are the four V's of big data, pertaining to analytics in healthcare?

Q5) Give an example of a future application of real-time data.

Q6) Name three platform/tool options for conducting big data analytics?

Q7) What are the four steps of the methodology of big data analytics?

Q8) How did twitter tracking compare to official reports of cholera in Haiti in 2010?

3. "Writing Instructional Material" assignment, in which we try to describe the steps for performing a standard analysis.

In-Class Exercise: Write instructions for someone wishing to make a linear regression model and make some additional predictions from that model. Assume that the user has access to software like R, and has experience with it. Also assume that they don't know which statistics tests and plots are useful.

Yours should be about 200 words. The following example is more than 400 words, but shows a lot of what you could run into.

Example: Instructions for performing a one-way ANOVA, including diagnostics and post-hoc analysis.

1. Check if you data is of the proper type. The response/dependent variable should be a numerical quantity, and the explanatory/independent variable should be a categorical or grouping variable. If one of these isn't true, then ANOVA is not the analysis you want.

2. Check the extent of any missing data, if any. If there is some, you may want to remove cases with any missing data, or you may want to impute before continuing. Note that reliable imputation of the group or category may be impossible. Similar considerations may be needed for very small groups.

3. Check the residuals of the model from lm() for normality. Do this either by histogram, Q-Qplot, or a formal hypothesis test like Shapiro-Wilks or Anderson-Darling. If there is strong evidence of non-normality, follow the non-parametric route. Otherwise, follow the parametric route.

Parametric route

4. Use the lm() command to build a model of 'response ~ explanatory', and save that model. Use the anova() command on the saved model to get an anova table.

5. The p-value on the right of the ANOVA table is result of testing the hypothesis that all of the group means are equal. If this is small (smaller than some arbitrarily pre-selected alpha, such as 0.05), then continue to the post-hoc test in step 6.

6. Perform a Tukey test with the TukeyHSD() command. The differences in each pair of means will be shown in the output. The p-values for group differences are automatically adjusted to the number of groups and pairwise comparisons you have. The null hypotheses being tested in each case is if those two group means in the pair are equal. Any small p-values suggest 'honestly significant' differences between the means.

Non-Parametric route

4. Perform a Krusal-Walls test (a non-parametric ANOVA) on the data by using the kruskal.test() on 'response ~ explanatory' model. The null hypothesis being tested here is if the mean RANK of each group is the same.

5. The p-value given with the Krusal-Walls test indicates if there are any significant differences in the RANK mean of the groups. If this is small (smaller than some arbitrarily pre-selected alpha, such as 0.05), then continue to the post-hoc test in step 6.

6. Perform a Wilcoxon test on each pair of means and acquire a p-value for each. Compute an adjusted alpha from you initial family-wide alpha and a multiple testing adjustment such as the Bonferroni, Sidak correction. Alternatively, perform a non-parametric post-hoc test such as a Dunn's Test.

My previously blogged reading assignments can be found here:

Designing Surveys:

Model Selection and Missing Data:

Causality and Significance Testing:

The keys to all these, and many other reading assignments will be available starting May 2018 in my e-book "Writing for Statisticians" in the TopHat Marketplace at:

Finally, these posts tend to get more traction on social media if they have images, so here is a picture of my dog.