Making good exam questions
is universally hard. The ideal question should have a clear solution
to those with the requisite understanding, but also difficult enough
that someone without the knowledge needed can guess at an answer.

An item response theory (IRT) based analysis can estimate the difficulty of a question, as well as the general skill of each of the test takers. The generalized partial credit model extends classical IRT from questions with binary scores to ones with an ordinal set of possible scores.

R code and example inside.

Additional complicating factors include accommodations for correct, but alternate or unexpected solutions, and barriers not directly related to the understanding being measures, such as language limitations. This is just one interpretation of 'good' for an exam question, because there are other issues like rote memorization vs. integrative understanding that are harder still to measure or define.

An item response theory (IRT) based analysis can estimate the difficulty of a question, as well as the general skill of each of the test takers. The generalized partial credit model extends classical IRT from questions with binary scores to ones with an ordinal set of possible scores.

R code and example inside.

Additional complicating factors include accommodations for correct, but alternate or unexpected solutions, and barriers not directly related to the understanding being measures, such as language limitations. This is just one interpretation of 'good' for an exam question, because there are other issues like rote memorization vs. integrative understanding that are harder still to measure or define.

On top of that, exam
questions have a limited lifetime. To be able to reuse a question and
maintain integrity between exams, information about the question
cannot leave the exam room, which is technically impossible because
people remember the exams they just did. Using new questions for each
evaluation means bearing the risk of using untested questions, as
well as bearing the workload of going through this difficult process
every workload.

In a later post, I will be
releasing a large set of the exam questions I have made and used in
the past, as well as the answer key and some annotations about each
problem, including measures of difficulty and discrimination power
that can be found using item response theory (IRT). This post is a
vignette on using IRT to get these measures from exams that have been
graded using the Crowdmark learning management software, which
retains the score awarded to each student for each question.

Item response theory (IRT)
can be used to find, after the fact, not just which of your exam
questions were difficult, but also which ones were more or less
effective at finding differences in ability between different
students. There are many R packages that use IRT, as found here
https://cran.r-project.org/web/views/Psychometrics.html
. For this analysis, I opted to use the latent trait model (ltm)
package because it included an implementation of the generalized
partial credit model (gpcm), which is particularly apt for the large,
open-ended questions that I prefer for exams.

The exam data from Crowdmark
has been scrubbed of identifying information. The order of the
students has also been randomized and the original row order removed.
However, for safety, it is highly recommended you check your data
this step manually to make sure no identifying information remains.
Crowdmark exports the data of the students in alphabetical order,
which is why reordering matters. If you plan to use multiple
evaluations, such the final exam and each midterm, make sure to
combine the data sets with cbind() before reordering the rows.

dat
=
dat[,-c("Crowdmark.ID","Score.URL","Email","Canvas.ID","Email","Name","Student.ID","Total")]

dat
= dat[sample(1:nrow(dat)),]

row.names(dat)
= ""

head(dat)

The standard IRT method like
latent trait models (LTM) won't suffice because it only works for
binary responses, and these questions have partial credit.
Furthermore, a Rasch model won't be appropriate because it assumes
that all the questions have the same discriminatory ability. Finding
which questions are better at discriminating between students a key
research question. More details are found here
https://www.rdocumentation.org/packages/ltm/versions/1.1-0/topics/gpcm

First, an inspection of the
model without further cleaning, looking specifically at the first
question of a final exam.

The scores for question 1
range from 0-5, and can take any score of a half-point increment.
Nobody scored a 4.5/5. Some of these counts are small to the point of
uselessness.

table(datQ$Q1)

0
0.5 1 1.5 2 2.5 3 3.5 4 5

6
1 12 5 32 14 32 12 22 3

library(ltm)

mod
= gpcm(dat)

summary(mod)

Coefficients:

$Q1

value
std.err z.value

Catgr.1
9.144 6.788 1.347

Catgr.2
-14.322 7.073 -2.025

Catgr.3
4.426 3.353 1.320

Catgr.4
-10.542 4.108 -2.566

Catgr.5
4.480 2.289 1.957

Catgr.6
-4.524 2.282 -1.982

Catgr.7
5.649 2.492 2.267

Catgr.8
-2.991 2.287 -1.308

Catgr.9
11.566 4.745 2.438

Dscrmn
0.180 0.056 3.232

If we reduce the number of
categories to six by rounding up each half point, then only the
endpoints have fewer than 10 observations.

dat
= round(dat)

table(round(dat$Q1))

0
1 2 3 4 5

7
12 51 32 34 3

How does the model for
question 1 differ after rounding the scores to the nearest whole?

mod
= gpcm(dat)

summary(mod)$coefficients$Q1

value
std.err z.value

Catgr.1
-2.25102527 1.4965783 -1.5041146

Catgr.2
-4.70417153 1.6356180 -2.8760821

Catgr.3
1.38695931 0.8272908 1.6765077

Catgr.4
0.07941232 0.7656370 0.1037206

Catgr.5
7.91144221 2.8570050 2.7691384

Dscrmn
0.32993218 0.1039574 3.1737254

The coefficients and their
standard errors are smaller. Given that the small groups were
removed, the smaller standard errors make sense.

That the coefficients is
smaller is partly a reflection of the larger discrimination coefficient. To determine the log-odds of being in one categories as
opposed to an adjacent one, the discrimination coefficient is
multiplied by the appropriate category coefficient. The discrimination coefficient also determines how much each unit of ability increases or
decreases those same log-odds. By default, ability scores range from
-4 for the weakest student in the cohort, to 0 for the average
student, to +4 for the strongest student.

For example,

The fitted probability of an

*student (ability = 0) getting 5/5 on the question, assuming that they got at least 4/5 would be***average**
1 - (exp(7.911 * 0.3300) /
(1 + exp(7.911 * 0.3300))) =

**0.0685**.
Compare this to the observed
3/37 = 0.0811 for the entire cohort.

Repeating this for getting a
score of 4+ given that the student got 3+.

Fitted: 1 - [exp(0.079 *
0.3300) / (1 + exp(0.079 * 0.3300))) =

**0.4935.**
Compare this to the observed
37/69 = 0.5362.

and for getting a score of
3+ given that the student got 2+.

Fitted: 1 - (exp(1.387 *
0.3300) / (1 + exp(1.387 * 0.3300))) =

**0.3875.**
Compare this to the observed
69/120 = 0.5750.

and for getting a score of
2+ given that the student got 1+.

Fitted: 1 - (exp(-4.704 *
0.3300) / (1 + exp(-4.704 * 0.3300))) = 0.8252.

Compare this to the observed
120/132 = 0.9091.

and finally, for getting a
score of 1+, unconditional

Fitted: 1 - (exp(-2.251 *
0.3300) / (1 + exp(-2.251 * 0.3300))) = 0.6776.

Compare this to the observed
132/139 = 0.9496.

These probabilities can be
quickly found with...

library(faraway)
# for the inverse logit function

coefs_cat
= coef(mod2)$Q1[1:5]

coefs_disc
= coef(mod2)$Q1[6]

ability
= 0

prob_cond
= 1 - ilogit((coefs_cat - ability) * coefs_disc)

prob_cond

For someone with ability
score +4 instead of zero, subtract 4 from each category coefficent.
So for a top student, the coefficients above would be 3.911, -3.921,
-2.613, -8.704, and -6.651 respectively. Their counterpart
probabilities would be 0.2157, 0.7848, 0.7031, 0.9465, and 0.8998
respectively.

ability
= 4

prob_cond
= 1 - ilogit((coefs_cat - ability) * coefs_disc)

prob_cond

We can use these conditional
probabilities to get marginal probabilities of score 1+, 2+, ..., and
therefore an expected score for an average student, or any student
with ability in the -4 to 4 range.

Ncat
= length(coef(mod)$Q1) - 1

coefs_cat
= coef(mod)$Q1[1:Ncat]

coefs_disc
= coef(mod)$Q1[Ncat + 1]

ability
= 0

top
= exp(c(0, cumsum(coefs_disc * (ability - coefs_cat))))

bottom
= sum(top)

prob
= top / bottom

prob

E_score
= sum( 0:Ncat * prob)

E_score

These calculations yield an
expected score of 2.61/5 for an average student, 3.83/5 for the best
student, and 1.04 for the worst student.

Compare the observed mean of
2.59 and range of 0 to 5 for n=131 students.

For convenience and
scalability, we can put the expected score calculations in a
function.

get.exp.score
= function(coefs, ability)

{

Ncat
= length(coefs) - 1

coefs_cat
= coefs[1:Ncat]

coefs_disc
= coefs[Ncat + 1]

top
= exp(c(0, cumsum(coefs_disc * (ability - coefs_cat))))

bottom
= sum(top)

prob
= top / bottom

E_score
= sum( 0:Ncat * prob)

return(E_score)

}

get.discrim
= function(coefs)

{

Ncat
= length(coefs) - 1

coefs_disc
= coefs[Ncat + 1]

return(coefs_disc)

}

Now we repeat this process
across every question. Note that the coef() function for the gpcm
model is a list of vectors, not a matrix. Therefore to reference the
kth element, we need to use [[k]].

library(ltm)

mod
= gpcm(dat)

Nquest
= length(coef(mod))

ability_level
= c(-4,-2,0,2,4)

Nability
= length(ability_level)

E_scores_mat
= matrix(NA,nrow=Nquest,ncol=Nability)

discrim
= rep(NA,Nquest)

scoremax
= apply(dat,2,max)

for(k
in 1:Nquest)

{

for(j
in 1:Nability)

{

E_scores_mat[k,j]
= get.exp.score(coef(mod)[[k]], ability_level[j])

discrim[k]
= get.discrim(coef(mod)[[k]])

}

##
Normalizing

E_scores_mat[k,]
= E_scores_mat[k,] / scoremax[k]

}

E_scores_mat
= round(E_scores_mat,4)

q_name
= paste0("FinalQ",1:Nquest)

q_info
= data.frame(q_name,E_scores_mat,scoremax,discrim)

names(q_info)
= c("Name",paste0("Escore_at",ability_level),"max","discrim")

q_info

Name
Escore_at-4 Escore_at-2 Escore_at0 Escore_at2 Escore_at4 max discrim

Q1
FinalQ1 0.2036 0.3572 0.5218 0.6666 0.7697
5 0.330

Q2
FinalQ2 0.0795 0.2585 0.7546 0.9789 0.9973
6 0.819

Q3
FinalQ3 0.0657 0.2916 0.6110 0.8681 0.9626
7 0.742

Q4
FinalQ4 0.0198 0.2402 0.7568 0.9281 0.9791
9 0.508

Q5
FinalQ5 0.1748 0.4496 0.5614 0.5870 0.5949
5 0.446

Q6
FinalQ6 0.1072 0.5685 0.9516 0.9927 0.9982
4 0.521

Q7
FinalQ7 0.3042 0.5379 0.7413 0.8465 0.9018
7 0.201

Q8
FinalQ8 0.1206 0.3451 0.6526 0.8945 0.9744
7 0.628

Q9
FinalQ9 0.1686 0.3031 0.4875 0.5827 0.6114
8 0.232

Q10
FinalQ10 0.1079 0.4182 0.7577 0.8466 0.8663 8
0.408

Q11
FinalQ11 0.2066 0.4157 0.5480 0.6187 0.6653 8
0.275

References:

Item Response Theory package
options https://cran.r-project.org/web/views/Psychometrics.html

Documenation on the GPCM
function in the ltm package
https://www.rdocumentation.org/packages/ltm/versions/1.1-0/topics/gpcm

Original paper on the
Generalized Partial Credit Model, by Eiji Muraki (1992)

## No comments:

## Post a Comment