Sunday, 29 November 2015

I read this: Validity and Validation

 Motivation for reading / Who should read this:
I read Validity and Validation – Understanding Statistics, by Catherine S. Taylor in part as a follow up to the meta-analysis book from two weeks ago, and in part because I wanted to know more about the process of validating a scale made from a series of questions (e.g. The Beck Depression Inventory, or a questionnaire about learning preferences).  For scale validation, there’s a field called Item Response Theory, which I understand better because of this book, but not at any depth.

This book, a quick read at 200 small pages, compliments the handbook of educational research design that Greg Hum and I made. I plan to recommend it anyone new to conducting social science research because it provides a solid first look at the sort of issues that can prevent justifiable inferences (called “threats to internal validity”), and those that can limit the scope of the results (called “threats to external validity”).

A good pairing with “Validity and Validation” is “How to Ask Survey Questions” by Ardene Fink. My findings from that book are in this post from last year. If I were to give reading homework to consulting clients, I would frequently assign both of these books.

What I learned:
Some new vocabulary and keywords.

For investigating causality, there is a qualitative analogue to directed acyclic graphs (DAGs), called ‘nomological networks’. Nomological networks are graphs describing factors, directly or not, that contribute to a construct. A construct is like a qualitative analogue of a response variable, but has a more inclusive definition.

To paraphrase of Chapter 1 of [1], beyond statistical checks that scores from a questionnaire or test accurately measure a construct, it’s still necessary to ensure the relevance and utility of that construct.

Hierarchical linear models (HLMs) resemble random effect models, or a model that uses Bayesian hyperpriors. An HLM is a linear model where the regression coefficients are themselves response values to their own linear models, possibly with random effects. More than two layers are possible, in which the coefficients in each of those models could also be responses to their own models, hence the term ‘hierarchical’.

What is Item Response Theory?
Item response theory (IRT) is set of methods that puts both questions/items and respondents/examinees in the same or related parameter spaces.

The simplest model is a 1-Parameter IRT model, also called a Rasch model. A Rasch model assigns a ‘difficulty’ or ‘location’ for an item based on how many respondents give a correct/high answer or an incorrect/low answer. At the same time, respondents also have a location value based on the items they give a correct/high response. An item that few people get correct will have a high location value, and a person that gets many items correct will have a high location value.

A 2-parameter model includes a dimension for ‘discrimination’. Items with higher discrimination will elicit a greater difference in responses between respondents with a lower and those with a higher location than the item. Models with more parameters and ones for non-binary questions also exist.

The WINSTEPS software package for item response theory (IRT):
WINSTEPS is a program that, when used on a data set of n cases giving numerical responses each of to p items, gives an assessment of how well each item fits in with the rest of the questionnaire. It gives two statistics: INFIT and OUTFIT. OUTFIT is like a goodness-of-fit measure for extreme respondents at the tail-ends of dataset. INFIT is a goodness-of-fit measure for typical respondents.

In the language of IRT, this means INFIT is sensitive to odd patterns from respondents whose locations are near that of the item, and OUTFIT is sensitive to odd patterns from respondents with locations far from the item. Here is a page with the computations behind each statistic.

On CRAN there is a package called Rwinsteps, which allows you to call functions in the WINSTEPS program inside R. There are many item response theory packages in R, but the more general ones appear to be “cacIRT”, “classify”, “emIRT”, “irtoys”, “irtProb”, “lordif”, “mcIRT”, “mirt”, “MLCIRTwithin”, “pcIRT”, and “sirt”.

For future reference.
Page 11 has a list of common threats to internal validity.

Pages 91-95 have a table of possible validity claims (e.g. “Scores can be used to make inferences”), which are broken down into specific arguments (e.g. “Scoring rules are applied consistently”), which in turn are broken down into tests of that argument (e.g. “Check conversion of ratings to score”).

Pages 158-160 have a table of guidelines for making surveys translatable between cultures. These are taken from a document of guidelines of translating and adapting tests between languages and cultures from the International Test Commission.

The last chapter is entirely suggestions for future reading. The following references stood out:

[1] (Book) Educational Measurement, 4th edition, by Brennan 2006. Especially the first chapter, by Messick

[2] (Book) Hierarchical Linear Models: Applications and Data Analysis by Ravdenbush and Bryk 2002.

[3] (Book) Structural Equation Modelling with EQS by Byrne 2006. (EQS is a software package)

[4] (Book) Fundamentals of Item Response Theory, by Hambleton, Swaminthan, and Rogers 1991.

[5] (Book) Experimental and Quasi-Experimental Designs for General Causal Interance by Shadish, Cook, and Campbell 2002. (this is probably different from the ANOVA/Factorial heavy Design/Analysis of Experiments taught in undergrad stats)

[6] (Journal) New Directions in Evaluation

Tuesday, 10 November 2015

I read this: Meta-Analysis, A Comparison of Approaches

My motivation for reading Meta-Analysis: A Comparison of Approaches by Ralph Schulze was to further explore the idea of a journal of replication and verification. Meta-analyses seemed like a close analogy, except that researchers are evaluating many studies together, rather than one in detail. I’m not working on any meta-analyses right now, but I may later. If you are reading this to decide if this book is right for you, consider that your motivations will differ.

Summary: ‘Meta Analysis’, about 200 pages, was easy to read for a graduate level textbook and managed to be rigorous without overwhelming the reader with formulae.

Most of the book is dedicated to the mathematical methods of finding collective estimates of a value or set of values from related independent studies. The latter half of the book is dedicated to describing these methods and on a large Monte-Carlo based comparison of the methods under a range of conditions. Conditions include different sample sizes per study, different number of studies, and different correlation coefficients of interest.

The first half was much more useful on a first reading, but the detailed descriptions and comparisons would make an excellent reference if I were preparing or performing a meta-analysis.
The ‘soft’ aspects of meta-analysis are only briefly touched upon, but several promising references are given.  References on retrieval of studies (e.g. sampling, scraping, and coverage) and assessing studies for quality and relevance include to several chapters of [1] and to [2].

Take-home lessons (i.e. what I learned):
The most common method to get a collective estimate of a parameter is to take a weighted sum of estimates from independent studies, with weights inversely proportional to the variance of each estimate. This method makes a very questionable assumption: that all papers studied are estimating the same parameter.
The authors call this assumption the fixed effect model. Some of the methods described explicitly use this model, but all of the methods include a test (usually using the chi-squared distribution) to detect if a fixed effect model is inappropriate.

Other models, such as Olkin and Pratt, and DerSimonian-Laird use the more complex, but more realistic random effects model. Under this model, the parameter that each study is estimating is related, but slightly different. Then the collective estimate that comes out of the meta-analysis is an estimate of some parameter with an extra layer of abstraction than the parameters described in each individual study.

There are other, yet more complex models that are viable, such as mixture models or a hierarchical linear models in which each study’s parameter estimate is an estimate of some combination of abstract parameters, but these are only briefly covered in ‘Meta Analysis’.

Many of the methods described used Fisher’s z transformation in some way, where

z =1/2 * ln ( (1 + r) / (1 - r)) ,

which is a pretty simple transformation for Pearson correlation coefficients r that maps from [-1,1] to (-infty, +infty), converges to normality way faster than r does, and has an approximate variance that only depends on the sample size n. (Found on pages 22-23).

Also, apparently transforming effect sizes into correlations by treating treatment group as continuous variable at 0 or 1 isn’t overly problematic (pages 30-32). However, it can be very useful in bringing in a wider range of studies when a collective correlation coefficient is desired.

I didn't find any clear beacon that said "this is where replication work is published", but I found the following promising leads:

[1] The Handbook of Research Synthesis (1994)
[2] Chalmers et al. (1981) “A method for assessing the quality of a randomized control trial.” Controlled Clinical Trials, volume 2, pages 31-49.
[3] Quality & Quantity: International Journal of Methodology
[4] Educational and Psychological Measurement (Journal)
[5] International Journal of Selection and Assessment
[6] Validity Generalization (Book)
[7] Combining Information: Statistical Issues and Opportunities for Research (Book)