Wednesday, 6 December 2017

Reflection on teaching a 400-level course on Big Data for statisticians.

This was the first course I have taught that I would consider a 'main track' course, in that the students were to learn more about what they were already competent in at the start. Most of the courses I have previously taught were 'service courses', in that they were designed and delivered by the Statistics Department in service to other departments that wanted their own students to have a stronger quantitative and experimental background (e.g. Stat 201, 203, 302, and 305). The exception, Stat 342, was designed for statistics majors, but is built as an introduction to SAS programming. Since most other courses in the program are taught using R or Python, teaching SAS feels like teaching a service course as well, in that I am teaching something away from the students' main competency, and enrollment is mainly driven by requirement rather than interest.

In my usual courses, I am frequently grilled from anxious students about what exactly is going to be on exams. Frequent complaints I receive in student responses are about how I spend too much time on 'for interest' things are not explicitly being tested on the exams. I've also found that I needed to adhere to rigid structure in course planning and in grading policy. Moving an assignment's due date, teaching something out of the stated syllabus order, changing the scope or schedule of a midterm, or even dropping an assignment and moving the grade weight have all caused a cascade of problems in previous classes.

Stat 440, Learning from Big Data, was a major shift.

I don't know which I prefer, and I don't know which is easier in the long term, but it is absolutely a different skill set. The bulk of the effort changed from managing people to managing content. I did not struggle to keep the classroom full, but I did struggle to meaningfully fill the classroom's time. I had planned to cover the principles of modern methods (cross validation, model selection, LASSO, dimension reduction, regression trees, neural nets), some data cleaning (missing data, imputation, image processing), some technology (SQL, parallelization, Hadoop), some text analysis (regular expressions, edit distance, XML processing), but I still had a couple of weeks that I had to fill at the end because of the lightning speed that I was able to burn through these topics without protest.

In 'big data', the limits of my own knowledge became a major factor. Most of what I covered in class wouldn't have been considered undergrad material ten years ago when I was a senior (imputation, LASSO, neural nets); some of it didn't exist (Hadoop). There are plenty of textbook and online resources for learning regression or ANOVA, but the information for many of the topics of this course were cobbled together from blog posts, technical reports, and research papers. A lot of resources were either extremely narrow in scope or vague to the point of uselessness. I needed materials that were high-level enough for someone not already a specialist to understand, and technical enough that someone well-versed in data science would get something of value from it, and I didn't find enough.

The flip side of this was that motivation was easy. Two of the three case studies assigned had an active competition component to them. The first such study was a US-based based challenge to use police data from three US cities. In this one, presentation was a major basis that the police departments would judge the results. As such, I had requests for help with plotting and geographic methods that were completely new to me. A similar thing happened with the 'iceberg' case study, based on this Kaggle competition. I taught the basics of neural nets, and at least three groups asked me about generalizations and modifications to neural nets that I didn't know about. (The other case study was a 'warm-up' in which I adapted material from a case study competition held by the Statistical Society of Canada. The students were not in active competition). At least 20% of the class has more statistical talent than I do.

In order to adapt to this challenge and advantage, I changed about mid-semester from my usual delivery method of a PDF slideshow to one of commentary while running through computer code. This worked well for material that would be directly useful for the three case study projects, such as all the image processing work I showed for the case study on determining the difference between icebergs and ships. It wasn't as good for material that would be studied for the final exam. I went through some sample programs on web scraping, and the feedback wasn't as positive for that, and the answers I got on the final exam for the web scraping question were too specific to the examples I had given.

A side challenge was the ethical dilemma of limiting my advice to students looking to improve their projects. I had to avoid using insights that other students had shared with me because of the competitive nature of the class. Normally if someone had difficulty with a homework problem, I could use their learning and share it with others, but this time, that wasn't automatically the case.

There was also a substantial size difference, I had 20-30 students, which is by far the smallest class I've ever lectured to. Previously, Stat 342 was my 'small' class, which had enrollment between 50-80, because it was compared to service classes of 100-300 students. This allowed me to actually communicate with students an a much more one-on-one level. Furthermore, since most of the work was done in small team settings, I got to know what each group of students was working on for their projects.

I worry that what I delivered wasn't exactly big data, and was really more of a mixed bag of data science. However, there was a lot of feedback from the students that they found it valuable, and value-added was what the goal all along.

Monday, 13 November 2017

Snakes and Ladders and Transition Matrices

Recently, /u/mikeeg555 created this post  on the statistics subreddit with their results from a simulation of 10,000,000 games of this instance of Snakes and Ladders. This is the sort of information that's good to show in an undergrad or senior secondary level classroom as a highlight of the sort of insights you can get from the kind of simulation that anyone can program.

It's a good exercise, but it got a lot of attention because, for some reason, not a single simulation found a game that ended in either exactly 129 or in exactly 251 turns. There were more than a thousand games that finished in 128 turns, and more than a thousand that finished in 130, but none for 129. The frequency of number of turns required resembled a smooth gamma distribution other than those two gaps. There were no unusually common responses to make up for the gaps. There were no further gaps at 258 (129 + 129), 387 (129 + 129 + 129), or 390 turns (129 + 251). Nothing except these two gaps was remarkable.

Why? How? Was there some cyclic, finite state, quirk to these turn counts? Was there an error in the original poster's code?
The code wasn't given until two hours after in an update, but within that time, three others had independently written programs to confirm or refute this:

/u/threenplusone, who wrote a simulator
/u/hootback , who wrote a program to calculate transition matrices in Python,
and myself, who wrote a transition matrix program in R, shown below:

None of us found the gap. (Answer why at bottom)

Transition matrices are a particularly nice way to 'solve' games like Snakes and Ladders because they give the probabilities of moving from one state (i.e. square) in a single turn. Another nice property is that you can make a new transition matrix of a pair of transitions by multiplying the matrices. We can describe a turn of snakes and ladders as two transitions:

1) A move of 1-6 spaces ahead, based on the roll of a fair 6-sided die, and, if relevant
2) A move from the bottom of a ladder to its end, or the tail of a snake to its head.

The first transition is a 100x100 matrix with six values of 1/6 in each row, and 94 values of 0. The second transition, the interaction with the board, is mostly a diagonal matrix of 1's for all the squares without a snake or ladder. When there IS a snake or ladder, the matrix row with the ladder bottom or snake tail has a value of 1 at the ladder top or snake head, and a zero on the diagonal entry. It's important to note that, at least in this version of the game, every ladder and snake leads to a square without any additional ladders and snakes.

To find the chance of getting from square i to square j in one turn, consult turn_matrix, which is the product of roll_matrix and SL_matrix (snake-ladder matrix). To find the chance in k turns, multiply turn_matrix by itself k times, and then consult entry (i,j)

Nsq = 100 # a 100 square snakes and ladders board

die_probs = rep(1/6,6) # a fair 6-sided die
die_size = length(die_probs)

roll_mat = matrix(NA,nrow=Nsq,ncol=Nsq) # transition matrix of rolls
SL_mat = matrix(0,nrow=Nsq,ncol=Nsq) # transistion matrix of snakes and ladders application

for(this_sq in 1:Nsq)
roll_count = 1
## Get the outcomes of a roll from square sq
roll_vec = rep(0,Nsq)
roll_vec[this_sq+(1:die_size)] = die_probs # apply the die probs

## Handle past-the-end rolls
if( length(roll_vec) > Nsq)
## Rule 1: Past the end --> the end
##over_prob = sum(roll_vec[-(1:Nsq)])
##roll_vec[Nsq] = roll_vec[Nsq] + over_prob
## Rule 2: Exact roll needed for end
over_prob = sum(roll_vec[-(1:Nsq)])
roll_vec[this_sq] = roll_vec[this_sq] + over_prob
## Rule 3: 'bounce' the over-roll. (Example, over-roll by 2 will end 2 squares from the end)
#over_prob_vec = roll_vec[-(1:Nsq)]
#Nover = length(over_prob_vec)
#roll_vec[(Nsq-1):(Nsq-Nover)] = roll_vec[(Nsq-1):(Nsq-Nover)] + over_prob_vec
## ALL RULES Truncate the roll vector back to the existing squares
roll_vec = roll_vec[1:Nsq]

## apply this vector to the matrix
roll_mat[this_sq,] = roll_vec

SL_mat[4,14] = 1 ## There is a ladder from 4 to 14, which is always taken
SL_mat[9,31] = 1
SL_mat[20,38] = 1
SL_mat[28,84] = 1
SL_mat[40,59] = 1
SL_mat[51,67] = 1
SL_mat[63,81] = 1
SL_mat[71,91] = 1
SL_mat[17,7] = 1 # A snake from 17 to 7
SL_mat[54,34] = 1
SL_mat[62,19] = 1
SL_mat[64,60] = 1
SL_mat[87,24] = 1
SL_mat[93,73] = 1
SL_mat[95,75] = 1
SL_mat[99,78] = 1

## All other squares lead to themselves
diag(SL_mat) = 1 - apply(SL_mat,1,sum)
#SL_mat = t(SL_mat) ## Row and columns are to-from, not from-to

### A turn consists of a roll and a snake/ladder. Only one snake or ladder/turn
turn_mat = roll_mat %*% SL_mat

### Get the game completion chance after k turns
game_length = 500
win_by_prob = rep(NA,game_length)
win_by_prob[1] = 0
kturns_mat = turn_mat

### kturns_mat[i,j] will give you the probability of ending on square j, starting on square i, in exactly k turns.

for(k in 2:game_length)
kturns_mat = kturns_mat %*% turn_mat
win_by_prob[k] = kturns_mat[1,Nsq]

win_at_prob = c(0,diff(win_by_prob))

sum(win_at_prob) ## Mean turns to finish
which(win_at_prob == 0) ## Impossible number of turns to finish

Answer: The original poster's code contained a plotting error, not a mathematical one. The gaps were an artifact of the binning system for the histogram used, which left the occasional empty bin.

Cunningham's Law states that “The best way to get the right answer on the Internet is not to ask a question, it's to post the wrong answer.” This post is a fine example, even if it was unintentional.

Wednesday, 8 November 2017

Writing a Resume as a Data Scientist or Statistician.

Consider the reader:

Your audience is NOT another data scientist, typically. In a large company, it will be someone in a human resources department who has been told to look for certain key words and skills. They might reed your resume for 20 seconds or less.

In a small company, you resume may get (a little) more time and may be read by someone closer to your specialty and (slightly) more familiar with the jargon of your little corner of your field. The same guiding principle governs both cases: make it as easy as possible for someone to evaluate and say 'yes'.

What is this reader going to want to know? “How can this person fill the missing hole in my organization RIGHT NOW?”

This means, even in highly qualified personnel jobs like those of data scientists, statisticians, programmers and researchers, that the potential for long term growth within a company is not a priority, at least not at the resume-reading stage. This is a major shift from the academic world, where timelines are typically much longer.

What does this mean to you, specifically:

- Opportunities come regularly, so don't panic if you don't get the 'right' position the first time it is posted.

- Future plans like the answer to the stereotypical interview question “where do you see yourself in 5 years” are now irrelevant to many employers. They shouldn't be mentioned on resume either. Stick to what is solid: The past and present.

- Emphasize your skills as they are right now, not where they will be in 6 months. (e.g. 'I am currently studying...')

- Promising company loyalty (e.g. 'I have always wanted to work at...') in cover letters and other correspondence is a waste of time as best, and comes across as insincere at worst.

On the subject of transcript grades:

- After a certain minimum passing threshold, grades are not a good indicator of job performance.

- If you have graduated, or are on track to graduate soon, then you are already above this threshold, and no more about your grades needs to be said.

- The one exception would be that it is a good idea to mention the awards you have received related to grades. This includes the dean's list and scholarships. (e.g. “Graduated in 2017 with distinction”, “Made the Dean's List in 2015”)

Hobbies and activities from high school are irrelevant unless they are programming related or you are applying to a position in fast food. In you're reading a book titled “Writing for Statisticians”, it had better be the first case or you are severely undervaluing the value of your labour.

Rather than talk about the grades you earn or the courses you took, describe the projects you did in these courses as experiences. Be specific and clear without relying on jargon or writing too much, and keep in mind that the reader is unlikely to be familiar with the course numbers and titles from your institution.

Example 1:

Very bad: “Took Stat 485”

Bad: “Took a course in time-series”

Good: “Analyzed a time-series dataset of the economy of Kansas state.”

Better: “Investigated time-series econometric data, and wrote an executive report.”

Example 2:

Bad: “Took a course in big data.”

Good: “Scraped, cleaned, and applied a random-forest model to police call data in a Kaggle competition.”

Also good: “Developed a model to predict crime hotspots from a JSON database from the Seattle Police Department. Presented findings in a slide deck.”

In each of the 'good' examples, the experience is written in such a way as to demonstrate as many high-value skills as possible in a limited space.

The 'good' time-series example signals that
- You (the writer) can analyze real data.
- You are familiar with time-series data.
- You are familiar with econometric data.

The 'better' time-series example signals that
- You (the writer) can analyze real data.
- You are familiar with time-series data.
- You are familiar with econometric data.
As well as...
- You can communicate your finds to non-specialists.

The 'good' big data example communicates that.
- You can analyze big (as in 'high volume') data.
- You can scrape data from the web, or at least an internal database.
- You can prepare and clean data.
- You can format results into a common government format (i.e. Kaggle).

The 'also good' big data example communicates that.
- You can analyze big (as in 'high volume') data.
- You can build predictive, actionable models.
- You can work with JSON data.
- You can disseminate your findings to non-specialists, such as experts in fields other than your own.

Use 'business language' to subtly stretch the truth and frame things more favourably. For example, use the work 'setback' instead 'of failure', or use 'leverage' instead of 'use' or 'exploit'.

Use 'action verbs' as a helpful guide to demonstrate your experience, especially in the first work of each statement of your experience. These are verbs that typically imply leadership, teamwork, or productivity skills. Such words include, but are not limited to:

(distributed, produced, created, developed, disseminated (i.e. spread), distributed, maintained, updated, cleaned, scraped, prepared, built, wrote, analyzed, coded, investigated)


In your experience and even your education, try to start as many sentences as possible with one of those action words. Remember to write about what you DID and not what you DO. In other words, use the past tense for everything including your current position.

One apparent exception is when describing duties instead of actions. A popular way to write about duties instead of actions is to describe duties is to write “responsible for...”. This sounds like it's present tense, however, it's short for the past tense “I was responsible for..”, which brings us cleanly to the next point:

Taking advantage of assumptions and formatting:

You may have noticed some things are missing from the 'good' examples of experience. Specifically, articles and some prepositions are mission. The statements on a resume hsould be closer to news headlines than to complete sentences.

Everything on a resume is assumed to be about the person whose name is at the top of the resume. “Statements that start with 'I was' are already longer than necessary; you and the things you have done are the topics of your resume, so it makes the most to include key information only, as long as it is not ambiguous. The other relevant 'what's and 'who's in a good resume are typically made clear from formatting.

Consider the following example:
Constructed the database management system for the company”

, which can be shortened to

Constructed database management system.”

while retaining all or nearly all of the meaning in a resume standpoint. In this example, the article “the” isn't necessary because without the, tests. Likewise, “for the company” is redundant. Who else would you be doing this work for, if not the company? (If it was a personal skill building exercise, you would still leave that information out, the point is that you have the skill. Why you got it is not important.)

The fewer words you use, which retaining the meaning, the less of theose previous 20 seconds of reading time will be to ensure as great as possible a share of that time is spent observing that you have the qualifications requested in the document.


As a footnote, this is my 100th published post, and it is also an excerpt from my upcoming interactive textbook, Writing for Statisticians, which should be available to the public on TopHat in Summer of 2018.

Thursday, 2 November 2017

Evaluating Exam Questions using Crowdmark and the Generalized Partial Credit Model

Making good exam questions is universally hard. The ideal question should have a clear solution to those with the requisite understanding, but also difficult enough that someone without the knowledge needed can guess at an answer. Additional complicating factors include accommodations for correct, but alternate or unexpected solutions, and barriers not directly related to the understanding being measures, such as language limitations. This is just one interpretation of 'good' for an exam question, because there are other issues like rote memorization vs. integrative understanding that are harder still to measure or define.

On top of that, exam questions have a limited lifetime. To be able to reuse a question and maintain integrity between exams, information about the question cannot leave the exam room, which is technically impossible because people remember the exams they just did. Using new questions for each evaluation means bearing the risk of using untested questions, as well as bearing the workload of going through this difficult process every workload.

In a later post, I will be releasing a large set of the exam questions I have made and used in the past, as well as the answer key and some annotations about each problem, including measures of difficulty and discrimination power that can be found using item response theory (IRT). This post is a vignette on using IRT to get these measures from exams that have been graded using the Crowdmark learning management software, which retains the score awarded to each student for each question.

Item response theory (IRT) can be used to find, after the fact, not just which of your exam questions were difficult, but also which ones were more or less effective at finding differences in ability between different students. There are many R packages that use IRT, as found here . For this analysis, I opted to use the latent trait model (ltm) package because it included an implementation of the generalized partial credit model (gpcm), which is particularly apt for the large, open-ended questions that I prefer for exams.

The exam data from Crowdmark has been scrubbed of identifying information. The order of the students has also been randomized and the original row order removed. However, for safety, it is highly recommended you check your data this step manually to make sure no identifying information remains. Crowdmark exports the data of the students in alphabetical order, which is why reordering matters. If you plan to use multiple evaluations, such the final exam and each midterm, make sure to combine the data sets with cbind() before reordering the rows.

dat = dat[,-c("Crowdmark.ID","Score.URL","Email","Canvas.ID","Email","Name","Student.ID","Total")]
dat = dat[sample(1:nrow(dat)),]
row.names(dat) = ""

The standard IRT method like latent trait models (LTM) won't suffice because it only works for binary responses, and these questions have partial credit. Furthermore, a Rasch model won't be appropriate because it assumes that all the questions have the same discriminatory ability. Finding which questions are better at discriminating between students a key research question. More details are found here

First, an inspection of the model without further cleaning, looking specifically at the first question of a final exam.

The scores for question 1 range from 0-5, and can take any score of a half-point increment. Nobody scored a 4.5/5. Some of these counts are small to the point of uselessness.


0 0.5 1 1.5 2 2.5 3 3.5 4 5
6 1 12 5 32 14 32 12 22 3
mod = gpcm(dat)

value std.err z.value
Catgr.1 9.144 6.788 1.347
Catgr.2 -14.322 7.073 -2.025
Catgr.3 4.426 3.353 1.320
Catgr.4 -10.542 4.108 -2.566
Catgr.5 4.480 2.289 1.957
Catgr.6 -4.524 2.282 -1.982
Catgr.7 5.649 2.492 2.267
Catgr.8 -2.991 2.287 -1.308
Catgr.9 11.566 4.745 2.438

Dscrmn 0.180 0.056 3.232

If we reduce the number of categories to six by rounding up each half point, then only the endpoints have fewer than 10 observations.

dat = round(dat)


0 1 2 3 4 5
7 12 51 32 34 3

How does the model for question 1 differ after rounding the scores to the nearest whole?

mod = gpcm(dat)

value std.err z.value
Catgr.1 -2.25102527 1.4965783 -1.5041146
Catgr.2 -4.70417153 1.6356180 -2.8760821
Catgr.3 1.38695931 0.8272908 1.6765077
Catgr.4 0.07941232 0.7656370 0.1037206
Catgr.5 7.91144221 2.8570050 2.7691384

Dscrmn 0.32993218 0.1039574 3.1737254

The coefficients and their standard errors are smaller. Given that the small groups were removed, the smaller standard errors make sense.

That the coefficients is smaller is partly a reflection of the larger discrimination coefficient. To determine the log-odds of being in one categories as opposed to an adjacent one, the discrimination coefficient is multiplied by the appropriate category coefficient. The discrimination coefficient also determines how much each unit of ability increases or decreases those same log-odds. By default, ability scores range from -4 for the weakest student in the cohort, to 0 for the average student, to +4 for the strongest student.

For example,

The fitted probability of an average student (ability = 0) getting 5/5 on the question, assuming that they got at least 4/5 would be
1 - (exp(7.911 * 0.3300) / (1 + exp(7.911 * 0.3300))) = 0.0685.

Compare this to the observed 3/37 = 0.0811 for the entire cohort.

Repeating this for getting a score of 4+ given that the student got 3+.
Fitted: 1 - [exp(0.079 * 0.3300) / (1 + exp(0.079 * 0.3300))) = 0.4935.
Compare this to the observed 37/69 = 0.5362.

and for getting a score of 3+ given that the student got 2+.
Fitted: 1 - (exp(1.387 * 0.3300) / (1 + exp(1.387 * 0.3300))) = 0.3875.
Compare this to the observed 69/120 = 0.5750.

and for getting a score of 2+ given that the student got 1+.
Fitted: 1 - (exp(-4.704 * 0.3300) / (1 + exp(-4.704 * 0.3300))) = 0.8252.
Compare this to the observed 120/132 = 0.9091.

and finally, for getting a score of 1+, unconditional
Fitted: 1 - (exp(-2.251 * 0.3300) / (1 + exp(-2.251 * 0.3300))) = 0.6776.
Compare this to the observed 132/139 = 0.9496.

These probabilities can be quickly found with...

library(faraway) # for the inverse logit function
coefs_cat = coef(mod2)$Q1[1:5]
coefs_disc = coef(mod2)$Q1[6]
ability = 0
prob_cond = 1 - ilogit((coefs_cat - ability) * coefs_disc)

For someone with ability score +4 instead of zero, subtract 4 from each category coefficent. So for a top student, the coefficients above would be 3.911, -3.921, -2.613, -8.704, and -6.651 respectively. Their counterpart probabilities would be 0.2157, 0.7848, 0.7031, 0.9465, and 0.8998 respectively.

ability = 4
prob_cond = 1 - ilogit((coefs_cat - ability) * coefs_disc)

We can use these conditional probabilities to get marginal probabilities of score 1+, 2+, ..., and therefore an expected score for an average student, or any student with ability in the -4 to 4 range.

Ncat = length(coef(mod)$Q1) - 1
coefs_cat = coef(mod)$Q1[1:Ncat]
coefs_disc = coef(mod)$Q1[Ncat + 1]
ability = 0
top = exp(c(0, cumsum(coefs_disc * (ability - coefs_cat))))
bottom = sum(top)
prob = top / bottom

E_score = sum( 0:Ncat * prob)

These calculations yield an expected score of 2.61/5 for an average student, 3.83/5 for the best student, and 1.04 for the worst student.

Compare the observed mean of 2.59 and range of 0 to 5 for n=131 students.

For convenience and scalability, we can put the expected score calculations in a function.

get.exp.score = function(coefs, ability)
   Ncat = length(coefs) - 1
   coefs_cat = coefs[1:Ncat]
   coefs_disc = coefs[Ncat + 1]
   top = exp(c(0, cumsum(coefs_disc * (ability - coefs_cat))))
   bottom = sum(top)
   prob = top / bottom

   E_score = sum( 0:Ncat * prob)

get.discrim = function(coefs)
   Ncat = length(coefs) - 1
   coefs_disc = coefs[Ncat + 1]


Now we repeat this process across every question. Note that the coef() function for the gpcm model is a list of vectors, not a matrix. Therefore to reference the kth element, we need to use [[k]].

mod = gpcm(dat)

Nquest = length(coef(mod))
ability_level = c(-4,-2,0,2,4)
Nability = length(ability_level)
E_scores_mat = matrix(NA,nrow=Nquest,ncol=Nability)
discrim = rep(NA,Nquest)
scoremax = apply(dat,2,max)

for(k in 1:Nquest)
   for(j in 1:Nability)
      E_scores_mat[k,j] = get.exp.score(coef(mod)[[k]], ability_level[j])
      discrim[k] = get.discrim(coef(mod)[[k]])

   ## Normalizing
   E_scores_mat[k,] = E_scores_mat[k,] / scoremax[k]
E_scores_mat = round(E_scores_mat,4)

q_name = paste0("FinalQ",1:Nquest)
q_info = data.frame(q_name,E_scores_mat,scoremax,discrim)
names(q_info) = c("Name",paste0("Escore_at",ability_level),"max","discrim")
Name Escore_at-4 Escore_at-2 Escore_at0 Escore_at2 Escore_at4 max discrim
Q1 FinalQ1 0.2036 0.3572 0.5218 0.6666 0.7697 5 0.330
Q2 FinalQ2 0.0795 0.2585 0.7546 0.9789 0.9973 6 0.819
Q3 FinalQ3 0.0657 0.2916 0.6110 0.8681 0.9626 7 0.742
Q4 FinalQ4 0.0198 0.2402 0.7568 0.9281 0.9791 9 0.508
Q5 FinalQ5 0.1748 0.4496 0.5614 0.5870 0.5949 5 0.446
Q6 FinalQ6 0.1072 0.5685 0.9516 0.9927 0.9982 4 0.521
Q7 FinalQ7 0.3042 0.5379 0.7413 0.8465 0.9018 7 0.201
Q8 FinalQ8 0.1206 0.3451 0.6526 0.8945 0.9744 7 0.628
Q9 FinalQ9 0.1686 0.3031 0.4875 0.5827 0.6114 8 0.232
Q10 FinalQ10 0.1079 0.4182 0.7577 0.8466 0.8663 8 0.408
Q11 FinalQ11 0.2066 0.4157 0.5480 0.6187 0.6653 8 0.275

Item Response Theory package options

Documenation on the GPCM function in the ltm package

Original paper on the Generalized Partial Credit Model, by Eiji Muraki (1992)

Tuesday, 3 October 2017

Book review of Improving How Universities Teach Science Part 2: Criticisms and comparison to the ISTLD

As far as pedagogical literature goes, Carl Wieman’s Improving How Universities Teach Science - Lessons from the Science Education Initiative was among my favourites. It keeps both jargon and length to a minimum, as it is barely more than 200 pages without counting the hiring guide at the end. The work and its presentation are strongly evidence-based and informative, and the transformation guide in the appendices provides possible actions that the reader can do to improve their own classes. Most of it is applicable to most of science education.

 I have some criticisms, but please don’t take this as a condemnation of the book or of the SEI in general.

The term ‘transformation’ is vague, despite being used extensively throughout the first two thirds of the book. This is partially forgivable because it has to be vague in order to cover what could be considered a transformation across many different science fields. However there could have been more examples or elaboration or a better definition of the term early on. First concrete examples to show up and clarify what transformation meant are found in the appendix, 170 pages in.

Dismissal of the UBC mathematics Department, and of mathematics education in general.

The metric Wyman used primarily the proportion of faculty that bought in to the program. That is the proportion of Faculty that transformed their courses, because typically faculty transformed all of their courses or none of them. Many departments were considered a success in that 70% or more of the faculty transformed their classes. Those that were under 50% were mostly special cases where they had entered into the Science Education initiative later and hadn't had the opportunity to transform. Among the All-Stars was the UBC statistics Department 88% of their 17 faculty with teaching appointments transform their classes. Among The Faculty of the UBC mathematics Department however only 10% of their 150 + strong Department bought in and transformed their classes. To contrast 1.2 million dollars was spent on the mathematics Department while $300,000 was spent on the statistics Department, so the mathematics people got more in total but the statistics people got more per faculty. It's not the failure to transform the mathematics Department that bothers me but the explanation for it.
Wieman boils down the failure to transform the mathematics department into two factors. First was the culture within that particular department, which was one that did not emphasize undergraduate education and seemed to assume that mathematics was an innate ability that either students had or had not regardless of the amount of effort put in. Before Wieman started attempting to transform this department it had a policy of automatically failing the bottom few percentiles of every introductory calculus class regardless of performance. The second factor Wieman uses to explain the failure is that mathematics is inherently not empirical, which means that a lot of the active learning meant to make Concepts more concrete would not have applied.

Having taught and been taught in both mathematics and statistics departments at multiple institutions myself I don't buy these arguments. From personal experience the most engaging and active classrooms I experienced have spread equally across mathematics and statistics. With in mathematics the most memorable was abstract algebra which by definition is non-empirical. Furthermore, at Simon Fraser University it's the mathematics department that has been leading the way on course transformation.
As for the argument about innate ability, this is an idea that spreads far beyond just university departments. I have no qualification to claim how true or false it is. However it's not a useful assumption, because it makes many things related to teaching quality in mathematics automatically non-actionable.
Finally it seems like a strange argument for a professor of physics to make about mathematics. I would have like to see more investigation and perhaps it's covered in some of his other literature, but then I would have like to see more reference towards that literature if it exists.

Compared to the Institute for the Study of Teaching and Learning in the Disciplines (ISTLD) at SFU, Wieman’s SEI is several times larger in scale and tackles the problem of university teaching entire departments at a time. The SEI works with department chairs directly and with faculty actively through their science education specialists. The ISTLD’s projects are self-contained course improvements, where staff and graduate student research assistants provided literature searches, initial guidance, and loose oversight over the course improvement projects. Both initiates fostered large volumes of published and publicly presented research.
The funding for course improvement projects through the ISTLD was non-competitive; the only requirements to receive a grant were to submit a draft proposal, to attend some workshops on pedagogy and to submit a new proposal guided by these workshops. Grants from the SEI at both UBC and CU was a competitive process, which Wieman used because, in his words, it was the only system familiar to science faculty.

In case you missed it, here is the first part of this book review, which discusses the content more directly.

Book review of Improving How Universities Teach Science, Part 1: Content.

Unlike many other books and literature on the subject, Carl Wieman’s Improving How Universities Teach Science - Lessons from the Science Education Initiative spent most of its pages talking about the administrative issues involving the improvement of university teaching. If you're familiar with recent pedagogical literature this book doesn't come with many surprises. What set it apart to me is the scale of the work that Wieman undertook, and his emphasis on educational improvement being an integrative process across an entire department rather than a set of independent advances.

The Science Education Initiative, or the SEI, model is about changing entire departments in large multi-year, multi-million dollar projects. The initiative focuses on transforming classes by getting faculty to buy into the idea of transforming them, rather than transforming the classes themselves directly.

The content is based on Wieman’s experience developing a science education initiative at both University of British Columbia (UBC) and at Colorado University (CU). It starts with a vision of what an ideal education system would look like any university mostly as an inspiring goal rather than any practical milestone. It continues with the description of how change was enacted in both of these universities. The primary workforce behind these changes was a new staff position called the science education specialist or SES. SES positions typically went to recent science PhD graduates of that had a particular interest in education. These specialists were hired and then trained in modern pedagogy and techniques to foster active learning. These specialists were assigned as consultants or partners to faculty that had requested help in course transformation.

The faculty themselves were induced to help through formal incentives like money for research, or through teaching buy-outs that allowed them more time to work on research, and through informal incentives like considering in teaching assignments and opportunities for co-authorship on scholarly research. Overcoming the already established incentive systems (e.g. publish or perish) that prioritized research over teaching was a common motif throughout this book.

The middle third of the book is reflective, and it’s also the meatiest part; if you’re short on time, read only Chapters 5, 6, and the coda.  Here, Wieman talks about which parts of the initiative worked immediately, which worked after changes, and which never worked and why. He talks about his change from a focus on changing courses to a focus on changing the attitudes of faculty. He talks about the differences in support he had at the different universities and how that affected the success of his program. For example, UBC got twice the financial support and direct leadership support from the dean. He also compares the success rate of different departments within the science faculty. Of particular interest to me are the UBC statistics and the UBC mathematics departments, which obtained radically different results. The statistics department almost unanimously transformed their courses, while the mathematics department almost unanimously didn’t.

Wieman also talks at length about ‘ownership’ of courses, and how faculty feeling that they own certain courses is a roadblock. Calling it a roadblock is partly because of the habit of faculty to keep their lecture notes to themselves on the assumption that they are the only one teaching a particular course. Furthermore, the culture of ownership was perceived to contribute to resistance from faculty to changes to their courses.

Under Wieman's model, course material is to be shared with the whole department so that anyone teaching a particular course has access to all the relevant material that has been made for it by department. Although UBC managed to create a repository for course material, the onus on populating that repository the faculty and there were few people that actually contributed. However where this matters most in the introductory courses even partial sharing was enough because many people tend to teach those courses.

The final third of the book is a set of appendices which include examples of learning activities and strategies in transformed courses, guiding principles for instruction, and several short essays on educational habits and references to much of the other work that Wieman has done. It also includes a hiring guide with sample interview questions for possible Science Education specialists.

The book also includes coda, which is an 8 page executive summary of the first two parts of the book. The coda served as a good review also a nicely packaged chapter that could be shared with decision makers such as deans and faculty chairs. Decision makers are exactly who I would recommend this book to; it has an excellent amount of information for the time and effort it takes to digest.

I had a few other thoughts about this book that were set aside for the sake of flow. You can find them in the second part of this book review.