Monday, 9 February 2015

Package spotlight - quantreg

Linear regression and its extensions are great for cases when you wish to model the mean of values Y in response to some predictors X.

Sometimes, however, you may be interested in the median response, or something more general like the cutoff for the top 20 percent of the responses for a given set of predictors. To model such cutoffs, quantile regression excels, and the quantreg package makes it feel a lot like linear regression.

This spotlight includes a simple bivariate normal example, then moves on to a more complex applciation that could be used to track player progression in a video game (e.g. a future digitization of Kudzu). First, we generate the bivarate normals, where y is the response to x with normal variation, and x is the predictor.

set.seed(12345)
x = rnorm(1000)
y = x + rnorm(1000)


Since the normal distribution is symmetric, the linear regression and quantile regression about the median should both approximate a slope of 1 and intercept of 0.

lm(y ~ x)$coef
(Intercept)           x 
-0.03220106  1.03914825 

library(quantreg)
rq(y ~ x)$coef
(Intercept)           x 
-0.02357192  0.98935700 

Furthermore, for other quantile regressions of this dataset, we would expect the slope to remain near 1 and the intercept to approximate the univariate normal quantiles. We expect this because the random variation in y is known to be Norm(0,1).

### 25th and 75th percentiles
rq(y ~ x, tau=c(.25,.75))
            tau= 0.25 tau= 0.75
(Intercept) -0.729756 0.6756721
x            1.042261 1.0169105

qnorm(c(.25,.75))
 -0.6744898  0.6744898

 
### 70th, 90th, 95th, 97.5th, 99.5th percentiles
rq(y ~ x, tau=c(.7,.9,.95,.975,.995))
            tau= 0.700 tau= 0.900 tau= 0.950 tau= 0.975 tau= 0.995
(Intercept)  0.5099463   1.298006   1.545429   1.833121  2.5811008
x            1.0036638   1.043121   1.023059   1.010482  0.8197345

qnorm( c(.7,.9,.95,.975,.995))
0.5244005 1.2815516 1.6448536 1.9599640 2.5758293


As you might expect. The model behaves unpredicably for extreme quantiles because useful data grows sparse. Finally, we can draw all these quantile regression lines.

 mod = rq(y ~ x, tau=c(.25,.5,.7,.9,.95,.975,.995))
 coefs = mod$coef
 plot(y ~ x)
 abline(h=0)
 abline(v=0)
 for(k in 1:ncol(coefs))  {   abline(coefs[,k], col="Red", lwd=2) }
 


For this next example, we are interested in estimating how many players can reach an arbitrary score in a game after playing a certain amount of time. This sort of information is valueable when tuning the difficulty of "reach a score of Y" challenge in a game so that few players can reach the goal right away but most players can after some effort. Consider the best score a player reaches in a game as the response Y and time spent on a challenge as a predictor X.

One problem: scores in games do not always increase linearly or in a simple transformation of linear. To handle this, the quantreg package also includes nlrq(), a non-linear quantile regression function. As before, we generate some player data.

#### Make a database of simulated users for our game
set.seed(12345)
Nusers = 1000

## An average of 20 minutes played, capped at 100 minutes
minutes = rexp(Nusers, rate=1/20) 
minutes = pmin(minutes,100) #pmin = parallel min

## performance is a function of skill (gamma), luck (norm) and time
score = rnorm(Nusers, mean=100,sd=100) + 
  10*minutes + 0.25*minutes^2 

gamedata = data.frame(score,minutes)  

Like the non-linear least squares function nls() in base R, nlrq() needs a formula with parameters on which to find an optimum fit. Starting values for the parameters are also needed. The formula can be user-defined function, such as the cubic polynomial function shown here:

### Polynomial degree 3 function
polyd3 = function(x,a,b,c)
{
 output = a + b*x + c*x^2
 return(output)
}


Now, using a flat line at zero our initial curve, we can build a model.

init = list(a=0,b=0,c=0)
mod = nlrq(score ~ polyd3(minutes,a,b,c),data=gamedata,start=init,tau=0.5)
coefs = summary(mod)$coef
coefs
        Value  Std. Error  t value Pr(>|t|)
a 103.7766248 7.381757743 14.05852        0
b   9.8370123 0.559544019 17.58041        0
c   0.2525046 0.007178738 35.17396        0

coefs[,1]
          a           b           c 
103.7766248   9.8370123   0.2525046 

If you look at the score formula we used in the simulation, the parameters of 100, 10, and 0.25 were all estimated fairly well. Also, the standard error estimates are reasonable, if not a bit conservative in this instance. The nlrq() function depends on summary() for most of the information, which is why the extra steps involving summary are shown here. We can use the formula function defined earlier to make predictions, and the delta method to get standard error of those predictions (delta method not shown).

 ### Median score after 15 minutes
 polyd3(x=15,coefs['a',1],coefs['b',1],coefs['c',1])
 308.1454
 
 
 ### Median score after 100 minutes
 polyd3(x=100,coefs['a',1],coefs['b',1],coefs['c',1])
 3612.524
 


Finally, we can also use the polyd3() and curve() to draw quantile curves through the data.

 attach(gamedata)
 plot(score ~ minutes, xlab="Time Played (minutes)", 
  ylab="Best Score", cex=0.6) #cex = Character EXpansion
 
 
 mod.10 = nlrq(score ~ polyd3(minutes,a,b,c),start=init,tau=0.1)
 coefs = summary(mod.10)$coef
 
 curve(polyd3(x,coefs['a',1],coefs['b',1],coefs['c',1]),
        from=0,to=100,col="Red",add=T,lwd=1)
 
 
 mod.50 = nlrq(score ~ polyd3(minutes,a,b,c),start=init,tau=0.5)
 coefs = summary(mod.50)$coef
 
 curve(polyd3(x,coefs['a',1],coefs['b',1],coefs['c',1]),
        from=0,to=100,col="Red",add=T,lwd=1)
 
 
 mod.90 = nlrq(score ~ polyd3(minutes,a,b,c),start=init,tau=0.9)
 coefs = summary(mod.90)$coef
 
 curve(polyd3(x,coefs['a',1],coefs['b',1],coefs['c',1]),
        from=0,to=100,col="Red",add=T,lwd=1)