This semester, I've been trying a lot of new assignments to encourage reading and writing of statistical literature as part of a new class and in preparation for a course pack I am publishing soon.
Here are two of the reading assignments and one of writing exercises that I tried this semester: "Data and the Law", "Big Data in Healthcare", and an exercise on writing good statistical instructions.
All of the required reading is open access.
1. "Data and the Law" reading assignment, in which we look at some of the complications of copyright, scientific facts, and compilations.
This
reading assignment pertains to “Data and the law: Beyond the sweat
of the brow. Who owns published data? And what is data?” by Gerald
van Belle and Leslie Ruiter, available at
https://rss.onlinelibrary.wiley.com/doi/full/10.1111/j.1740-9713.2014.00737.x
Q1)
What are some things that the US Copyright Act denies protection to?
Name at least three.
Q2)
When was it ruled that telephone numbers were not subject to
copyright?
Q3)
Say that you wanted to use the information in Table 1 in your own
publication. Give two other ways that Table 1 could be changed in
order to meet the 'modicum of originality' requirement.
Q4)
What are the restrictions, if any, on making a graph using someone
else's data?
Q5)
Which of the three, EU, USA, and Canada is the most restrictive on
copyright law pertaining to data? Which is the least?
The following 8 questions can be answered by reading the article “Big data analytics in healthcare: promise and potential” by Wullianallur Raghupathi and Viju Raghupathi in Health Information Science and Systems 2014, 2:3
Available at: http://www.hissjournal.com/content/2/1/3
Q1) What is more difficult in working with big data in healthcare? Why?
Q2) Give two examples of how big data analytics can lead to improved outcomes.
Q3) What are some developments or outcomes that can be predicted with big data?
Q4) What are the four V's of big data, pertaining to analytics in healthcare?
Q5) Give an example of a future application of real-time data.
Q6) Name three platform/tool options for conducting big data analytics?
Q7) What are the four steps of the methodology of big data analytics?
Q8) How did twitter tracking compare to official reports of cholera in Haiti in 2010?
Available at: http://www.hissjournal.com/content/2/1/3
Q1) What is more difficult in working with big data in healthcare? Why?
Q2) Give two examples of how big data analytics can lead to improved outcomes.
Q3) What are some developments or outcomes that can be predicted with big data?
Q4) What are the four V's of big data, pertaining to analytics in healthcare?
Q5) Give an example of a future application of real-time data.
Q6) Name three platform/tool options for conducting big data analytics?
Q7) What are the four steps of the methodology of big data analytics?
Q8) How did twitter tracking compare to official reports of cholera in Haiti in 2010?
3. "Writing Instructional Material" assignment, in which we try to describe the steps for performing a standard analysis.
In-Class
Exercise: Write instructions for someone wishing to make a linear
regression model and make some additional predictions from that model.
Assume that the user has access to software like R, and has
experience with it. Also assume that they don't know which statistics
tests and plots are useful.
Yours should be about 200 words. The
following example is more than 400 words, but shows a lot of what you could run
into.
Example: Instructions for
performing a one-way ANOVA, including diagnostics and post-hoc
analysis.
1. Check if you data is of the proper
type. The response/dependent variable should be a numerical quantity,
and the explanatory/independent variable should be a categorical or
grouping variable. If one of these isn't true, then ANOVA is not the
analysis you want.
2. Check the extent of any missing
data, if any. If there is some, you may want to remove cases with any
missing data, or you may want to impute before continuing. Note that
reliable imputation of the group or category may be impossible.
Similar considerations may be needed for very small groups.
3. Check the residuals of the model
from lm() for normality. Do this either by histogram, Q-Qplot, or a
formal hypothesis test like Shapiro-Wilks or Anderson-Darling. If
there is strong evidence of non-normality, follow the non-parametric
route. Otherwise, follow the parametric route.
Parametric route
4. Use the lm() command to build a
model of 'response ~ explanatory', and save that model. Use the
anova() command on the saved model to get an anova table.
5. The p-value on the right of the
ANOVA table is result of testing the hypothesis that all of the group
means are equal. If this is small (smaller than some arbitrarily
pre-selected alpha, such as 0.05), then continue to the post-hoc test
in step 6.
6. Perform a Tukey test with the
TukeyHSD() command. The differences in each pair of means will be
shown in the output. The p-values for group differences are
automatically adjusted to the number of groups and pairwise
comparisons you have. The null hypotheses being tested in each case
is if those two group means in the pair are equal. Any small p-values
suggest 'honestly significant' differences between the means.
Non-Parametric route
4. Perform a Krusal-Walls test (a
non-parametric ANOVA) on the data by using the kruskal.test() on
'response ~ explanatory' model. The null hypothesis being tested here
is if the mean RANK of each group is the same.
5. The p-value given with the
Krusal-Walls test indicates if there are any significant differences
in the RANK mean of the groups. If this is small (smaller than some
arbitrarily pre-selected alpha, such as 0.05), then continue to the
post-hoc test in step 6.
6. Perform a Wilcoxon test on each pair
of means and acquire a p-value for each. Compute an adjusted alpha
from you initial family-wide alpha and a multiple testing adjustment
such as the Bonferroni, Sidak correction. Alternatively, perform a
non-parametric post-hoc test such as a Dunn's Test.
My previously blogged reading assignments can be found here:
Designing Surveys:
Model Selection and Missing Data:
Causality and Significance Testing:
The keys to all these, and many other reading assignments will be available starting May 2018 in my e-book "Writing for Statisticians" in the TopHat Marketplace at:
https://tophat.com/marketplace/
Finally, these posts tend to get more traction on social media if they have images, so here is a picture of my dog.
Finally, these posts tend to get more traction on social media if they have images, so here is a picture of my dog.
No comments:
Post a Comment