Thursday, 2 November 2017

Evaluating Exam Questions using Crowdmark and the Generalized Partial Credit Model

Making good exam questions is universally hard. The ideal question should have a clear solution to those with the requisite understanding, but also difficult enough that someone without the knowledge needed can guess at an answer. Additional complicating factors include accommodations for correct, but alternate or unexpected solutions, and barriers not directly related to the understanding being measures, such as language limitations. This is just one interpretation of 'good' for an exam question, because there are other issues like rote memorization vs. integrative understanding that are harder still to measure or define.

On top of that, exam questions have a limited lifetime. To be able to reuse a question and maintain integrity between exams, information about the question cannot leave the exam room, which is technically impossible because people remember the exams they just did. Using new questions for each evaluation means bearing the risk of using untested questions, as well as bearing the workload of going through this difficult process every workload.

In a later post, I will be releasing a large set of the exam questions I have made and used in the past, as well as the answer key and some annotations about each problem, including measures of difficulty and discrimination power that can be found using item response theory (IRT). This post is a vignette on using IRT to get these measures from exams that have been graded using the Crowdmark learning management software, which retains the score awarded to each student for each question.


Item response theory (IRT) can be used to find, after the fact, not just which of your exam questions were difficult, but also which ones were more or less effective at finding differences in ability between different students. There are many R packages that use IRT, as found here https://cran.r-project.org/web/views/Psychometrics.html . For this analysis, I opted to use the latent trait model (ltm) package because it included an implementation of the generalized partial credit model (gpcm), which is particularly apt for the large, open-ended questions that I prefer for exams.

The exam data from Crowdmark has been scrubbed of identifying information. The order of the students has also been randomized and the original row order removed. However, for safety, it is highly recommended you check your data this step manually to make sure no identifying information remains. Crowdmark exports the data of the students in alphabetical order, which is why reordering matters. If you plan to use multiple evaluations, such the final exam and each midterm, make sure to combine the data sets with cbind() before reordering the rows.

dat = dat[,-c("Crowdmark.ID","Score.URL","Email","Canvas.ID","Email","Name","Student.ID","Total")]
dat = dat[sample(1:nrow(dat)),]
row.names(dat) = ""
head(dat)


The standard IRT method like latent trait models (LTM) won't suffice because it only works for binary responses, and these questions have partial credit. Furthermore, a Rasch model won't be appropriate because it assumes that all the questions have the same discriminatory ability. Finding which questions are better at discriminating between students a key research question. More details are found here https://www.rdocumentation.org/packages/ltm/versions/1.1-0/topics/gpcm


First, an inspection of the model without further cleaning, looking specifically at the first question of a final exam.


The scores for question 1 range from 0-5, and can take any score of a half-point increment. Nobody scored a 4.5/5. Some of these counts are small to the point of uselessness.

table(datQ$Q1)

0 0.5 1 1.5 2 2.5 3 3.5 4 5
6 1 12 5 32 14 32 12 22 3
library(ltm)
mod = gpcm(dat)
summary(mod)



Coefficients:
$Q1
value std.err z.value
Catgr.1 9.144 6.788 1.347
Catgr.2 -14.322 7.073 -2.025
Catgr.3 4.426 3.353 1.320
Catgr.4 -10.542 4.108 -2.566
Catgr.5 4.480 2.289 1.957
Catgr.6 -4.524 2.282 -1.982
Catgr.7 5.649 2.492 2.267
Catgr.8 -2.991 2.287 -1.308
Catgr.9 11.566 4.745 2.438

Dscrmn 0.180 0.056 3.232


If we reduce the number of categories to six by rounding up each half point, then only the endpoints have fewer than 10 observations.

dat = round(dat)

table(round(dat$Q1))

0 1 2 3 4 5
7 12 51 32 34 3

How does the model for question 1 differ after rounding the scores to the nearest whole?

mod = gpcm(dat)
summary(mod)$coefficients$Q1

value std.err z.value
Catgr.1 -2.25102527 1.4965783 -1.5041146
Catgr.2 -4.70417153 1.6356180 -2.8760821
Catgr.3 1.38695931 0.8272908 1.6765077
Catgr.4 0.07941232 0.7656370 0.1037206
Catgr.5 7.91144221 2.8570050 2.7691384

Dscrmn 0.32993218 0.1039574 3.1737254

The coefficients and their standard errors are smaller. Given that the small groups were removed, the smaller standard errors make sense.

That the coefficients is smaller is partly a reflection of the larger discrimination coefficient. To determine the log-odds of being in one categories as opposed to an adjacent one, the discrimination coefficient is multiplied by the appropriate category coefficient. The discrimination coefficient also determines how much each unit of ability increases or decreases those same log-odds. By default, ability scores range from -4 for the weakest student in the cohort, to 0 for the average student, to +4 for the strongest student.

For example,

The fitted probability of an average student (ability = 0) getting 5/5 on the question, assuming that they got at least 4/5 would be
1 - (exp(7.911 * 0.3300) / (1 + exp(7.911 * 0.3300))) = 0.0685.

Compare this to the observed 3/37 = 0.0811 for the entire cohort.

Repeating this for getting a score of 4+ given that the student got 3+.
Fitted: 1 - [exp(0.079 * 0.3300) / (1 + exp(0.079 * 0.3300))) = 0.4935.
Compare this to the observed 37/69 = 0.5362.

and for getting a score of 3+ given that the student got 2+.
Fitted: 1 - (exp(1.387 * 0.3300) / (1 + exp(1.387 * 0.3300))) = 0.3875.
Compare this to the observed 69/120 = 0.5750.

and for getting a score of 2+ given that the student got 1+.
Fitted: 1 - (exp(-4.704 * 0.3300) / (1 + exp(-4.704 * 0.3300))) = 0.8252.
Compare this to the observed 120/132 = 0.9091.

and finally, for getting a score of 1+, unconditional
Fitted: 1 - (exp(-2.251 * 0.3300) / (1 + exp(-2.251 * 0.3300))) = 0.6776.
Compare this to the observed 132/139 = 0.9496.

These probabilities can be quickly found with...

library(faraway) # for the inverse logit function
coefs_cat = coef(mod2)$Q1[1:5]
coefs_disc = coef(mod2)$Q1[6]
ability = 0
prob_cond = 1 - ilogit((coefs_cat - ability) * coefs_disc)
prob_cond


For someone with ability score +4 instead of zero, subtract 4 from each category coefficent. So for a top student, the coefficients above would be 3.911, -3.921, -2.613, -8.704, and -6.651 respectively. Their counterpart probabilities would be 0.2157, 0.7848, 0.7031, 0.9465, and 0.8998 respectively.

ability = 4
prob_cond = 1 - ilogit((coefs_cat - ability) * coefs_disc)
prob_cond

We can use these conditional probabilities to get marginal probabilities of score 1+, 2+, ..., and therefore an expected score for an average student, or any student with ability in the -4 to 4 range.

Ncat = length(coef(mod)$Q1) - 1
coefs_cat = coef(mod)$Q1[1:Ncat]
coefs_disc = coef(mod)$Q1[Ncat + 1]
ability = 0
top = exp(c(0, cumsum(coefs_disc * (ability - coefs_cat))))
bottom = sum(top)
prob = top / bottom
prob

E_score = sum( 0:Ncat * prob)
E_score

These calculations yield an expected score of 2.61/5 for an average student, 3.83/5 for the best student, and 1.04 for the worst student.

Compare the observed mean of 2.59 and range of 0 to 5 for n=131 students.

For convenience and scalability, we can put the expected score calculations in a function.


get.exp.score = function(coefs, ability)
{
   Ncat = length(coefs) - 1
   coefs_cat = coefs[1:Ncat]
   coefs_disc = coefs[Ncat + 1]
   top = exp(c(0, cumsum(coefs_disc * (ability - coefs_cat))))
   bottom = sum(top)
   prob = top / bottom

   E_score = sum( 0:Ncat * prob)
   return(E_score)
}



get.discrim = function(coefs)
{
   Ncat = length(coefs) - 1
   coefs_disc = coefs[Ncat + 1]

   return(coefs_disc)
}


Now we repeat this process across every question. Note that the coef() function for the gpcm model is a list of vectors, not a matrix. Therefore to reference the kth element, we need to use [[k]].

library(ltm)
mod = gpcm(dat)

Nquest = length(coef(mod))
ability_level = c(-4,-2,0,2,4)
Nability = length(ability_level)
E_scores_mat = matrix(NA,nrow=Nquest,ncol=Nability)
discrim = rep(NA,Nquest)
scoremax = apply(dat,2,max)

for(k in 1:Nquest)
{
   for(j in 1:Nability)
   {
      E_scores_mat[k,j] = get.exp.score(coef(mod)[[k]], ability_level[j])
      discrim[k] = get.discrim(coef(mod)[[k]])
   }

   ## Normalizing
   E_scores_mat[k,] = E_scores_mat[k,] / scoremax[k]
}
E_scores_mat = round(E_scores_mat,4)


q_name = paste0("FinalQ",1:Nquest)
q_info = data.frame(q_name,E_scores_mat,scoremax,discrim)
names(q_info) = c("Name",paste0("Escore_at",ability_level),"max","discrim")
 
q_info
Name Escore_at-4 Escore_at-2 Escore_at0 Escore_at2 Escore_at4 max discrim
Q1 FinalQ1 0.2036 0.3572 0.5218 0.6666 0.7697 5 0.330
Q2 FinalQ2 0.0795 0.2585 0.7546 0.9789 0.9973 6 0.819
Q3 FinalQ3 0.0657 0.2916 0.6110 0.8681 0.9626 7 0.742
Q4 FinalQ4 0.0198 0.2402 0.7568 0.9281 0.9791 9 0.508
Q5 FinalQ5 0.1748 0.4496 0.5614 0.5870 0.5949 5 0.446
Q6 FinalQ6 0.1072 0.5685 0.9516 0.9927 0.9982 4 0.521
Q7 FinalQ7 0.3042 0.5379 0.7413 0.8465 0.9018 7 0.201
Q8 FinalQ8 0.1206 0.3451 0.6526 0.8945 0.9744 7 0.628
Q9 FinalQ9 0.1686 0.3031 0.4875 0.5827 0.6114 8 0.232
Q10 FinalQ10 0.1079 0.4182 0.7577 0.8466 0.8663 8 0.408
Q11 FinalQ11 0.2066 0.4157 0.5480 0.6187 0.6653 8 0.275

References:
Item Response Theory package options https://cran.r-project.org/web/views/Psychometrics.html

Documenation on the GPCM function in the ltm package https://www.rdocumentation.org/packages/ltm/versions/1.1-0/topics/gpcm

Original paper on the Generalized Partial Credit Model, by Eiji Muraki (1992)