## Saturday, 29 December 2018

### Degrees of Freedom, Explained

You can interpret degrees of freedom, or DF as the number of (new) pieces of information that go into a statistic. Using examples from this video [https://www.youtube.com/watch?v=rATNoxKg1yA , James Gilbert, “What are degrees of freedom”]

I personally prefer to think of DF as a kind of statistical currency. You earn it by taking independent sample units, and you spend it on estimating population parameters or on information required to get compute test statistics.

In this article, degrees of freedom are explained through these lenses through some common hypothesis tests, with some selected topics like saturation, fractional DF, and mixed effect models at the end.

#### Spending DF, T-Tests

Taking the mean and standard deviation from a sample of size N from a single population, we start with N DF, and 'spend' 1 of them on estimating the mean, which is necessary for calculating the standard deviation.

S = sqrt( sum(x – x-bar)^2 / (N-1))

The remaining N-1 can be 'spent' on estimating the standard deviation.

In a two sample t-test setting, you need to estimate the difference (or, more generally, a contrast), between the means of two different populations. This test uses samples of size N1 and N2 from these two populations respectively. That implies that you have N1 + N2 degrees of freedom, and that you spend 2 of them estimating the 2 means. The remaining N1 + N2 - 2 can be used on estimating the uncertainty. How that N1 + N2 - 2 is spent depends on your assumptions about the variance. If you assume that both groups have the same variance, then you can spend all (N1 + N2 - 2) DF on estimating that one ‘pooled’ variance.

If you do not assume equal variance between the two populations, you need spend (N1 - 1) of the DF on estimating the standard deviation of population 1, and (N2 - 1) on estimating the standard deviation of population 2. We’re still estimating a single contrast between the population means, and we need to apply a single t-distribution to the contrast.

How much we know about the standard deviation of this contrast depends on how much information we have about each of the two standard deviations. If we don't have a computer on hand, we can rely on the worst-case scenario, which is that we know only as much as what we know about the smallest of the two samples, that is min(N1 - 1, N2 - 1) DF. More commonly, we calculate a 'DF equivalent' based on how close the two variance estimates are. The closer the estimates are, the closer to the ideal (N1 + N2 - 2) DF we assume that we have.

#### Spending DF, ANOVA

In a One-Way ANOVA setting for k means. we have samples of size N1, N2, ... , Nk from each of k populations, respectively. That implies that we have (N1 + N2 + ... Nk) DF to work with. Let's call that N DF for simplicity.

A One-Way ANOVA is a comparison of the group means to the grand mean (mean of ALL observations). So we need 1 DF for the grand mean, and (k-1) DF for the k group means. Why k-1? Because the last group mean can be estimated from the other groups and the grand mean. In other words, we get it 'for free'. These (k-1) DF are spent on measuring the standard deviation BETWEEN the groups.

That leaves (N-k) DF for estimating the standard deviation WITHIN each group. Note that ANOVA requires the assumption that all the groups have equal variance, such that we use all the remaining degrees of freedom to estimate that collective standard deviation.

#### Spending DF, Regression

In a simple linear regression setting, we have N independent observations, and each observation has two values in an (x,y) pair. We need to estimate the slope and the intercept, so that's 1 DF each, or 2 DF total. That leaves (N-2) DF for estimating the uncertainty.

With linear regression, we also have a nice geometric interpretation of DF. A line can always be fit through two points. If we have N points, then we use 2 of them to fit a line, and the remaining N-2 points represent random noise.

With multiple regression, we have p ‘slope’ parameters and a sample of N. In this case, we start with N DF, spend 1 DF on the intercept, and p DF on the slopes, leaving us with (N - p - 1) DF to estimate uncertainty.

#### Spending DF, Chi-Squared Tests

With t-tests, ANOVA, and regression, we are essentially finding the degrees of freedom to use as a parameter in one or two t-distributions. Also, the observed responses (y variables) in these cases are composed of continuous, numeric values. When the responses are categorical, the situation is radically different.

There are two commonly used tests conducted on categorical variables using the chi-squared statistic: Goodness-of-fit tests and independence tests, also called one-way and two-way chi-squared tests, respectively. Both of these tests are calculated by finding the expected number of responses for each category, and comparing them to the observed responses:

Chi-squared = sum( O – E)^2 / E

For the one-way / goodness-of-fit test, we have one categorical variable of C categories. The total of the observed counts O and the expected counts E both need to add up to the sample total of observations N. We need the total N in order to find the expected counts E, just like how we need the mean x-bar in order to find the sample standard deviation s in the numerical case.

As such, once you have the total and O and E for the first C-1 categories, you automatically have it for the last category. Analogously to the standard deviation situation, this means we have C-1 degrees of freedom in a one-way chi-squared test.

We have C categories with numbers in them, but we need to spend 1 DF on finding the total, leaving (C – 1) DF for estimating uncertainty.

For the two-way / independence test, we have two categorical variables of C ‘column’ and R ‘row’ categories each respectively. That implies that there are C*R combinations of categories. The expected counts for each combination, or cell, are computed from the R row totals and the C column totals. There is a bit of redundancy, so that’s actually R + C – 1 independent pieces of information.

We have C*R cells of information, but for doing a test of independence, we need R + C – 1 pieces of information from the row and column totals. That leaves (C*R – R – C – 1) or (C-1)*(R-1) degrees of freedom to spend on uncertainty quantification. Haven't quite had your fill? Here's some theory.

#### Fractional Degrees of Freedom

One particular thorny notion about equivalent degrees of freedom is that we can end up working with a number of degrees of freedom that are not whole numbers. Given that each independent data point yields 1 DF, that's a little bizarre.

First, we calculate equivalent degrees of freedom, sometimes we're using it to calculate something that is a composite of two or more measures. The contrast (e.g. difference) between the two means in the two sample t-test, for example, involves calculating two different standard deviations, so we're already straying from that idea of 'the amount of information going into a single estimate'.

Second, that word INDEPENDENT is a big one. If we have 10 completely independent observations, then we have a sample of size N=10. But, if those observations are correlated in some way (e.g. in a time series, like the day-to-day average temperature), then each new recorded number isn't giving as much information as a completely independent observation. In cases like this, we sometimes calculate an 'effective sample size', which would be somewhere between 1 and N, depending on how correlated the observations were. That effective sample size doesn't have to be a whole number, so neither do the degrees of freedom calculations that are derived from it. (For more on effective sample size, see psuedoreplication).

Thirdly, mathematically, there often isn't a problem with using a non-whole number of degrees of freedom. Both the t-distribution and the chi-squared distribution work just as well with DF = 3.5 as it does with DF =3 or DF = 4.

#### Saturation, DF Bankruptcy

If we ever have 0 DF left over after estimating all the means, slope parameters, or another other parameters, then we have what's called a saturated model. In chemistry, a saturated solution is one that is holding all the dissolved material that it can. A saturated model is one that is estimating all the parameters that it can. There is nothing left to measure uncertainty in those estimates.

For a saturated ANOVA, we can estimate each of the group means, but we have no way of knowing how good those estimates are. For a saturated regression, we can get the intercept and the slope, but we have no way of knowing how uncertain we should be about those estimates.

In a saturated model, things like confidence intervals, standard errors, and p-values are impossible to obtain.

One common solution to saturation is to impose additional assumptions or restrictions on the model. In an ANOVA, we might use a fractional factorial model and not bother to estimate certain high-level interactions. In a regression, we might treat a set of group effects as random effects, and not consider them when trying to fit the line of best fit.

In the fractional factorial case mentioned in ANOVA, this is for multi-way ANOVAs, but one-way ANOVAs, and the solution is to simply assume that some higher-order interactions are zero. If you assume they are zero, you don't need to estimate them.

The LASSO, a regression-like method that can handle situations where the number of possible parameters p is greater than the sample size N, works on a similar principle: it assumes that most of those possible parameters are zero, thus saving the degrees of freedom necessary to estimate them.

#### Mixed-Effects and REML

For the regression case without random effects, the slopes are traditionally estimated using a method based on maximum likelihood or ML. In lay-terms, ML is "given then data that we observe, what are the parameter values that would have the highest chance of producing data like this".

When we introduce random effects, REML is used instead, which is short for Restricted Estimation of Maximum Likelihood. In this case, we only estimate the non-random effects (that is, the fixed effects, the ones we actually care about) using maximum likelihood, and then assign the random effects as after-the-fact adjustments to our predictions. By not using the random effects in fitting the model, we don't need to spend any degrees of freedom to estimate them, and we can save those degrees of freedom for estimating uncertainty instead. Thus either preventing saturation, or giving better confidence intervals, standard errors, and p-values. The trade-off is that we still have no uncertainty measures for the random effects, but that's an acceptable issue in many cases.