I personally prefer
to think of DF as a kind of statistical currency. You earn it by
taking independent sample units, and you spend it on estimating
population parameters or on information required to get compute test
statistics.

In this article,
degrees of freedom are explained through these lenses through some common hypothesis
tests, with some selected topics like saturation, fractional DF, and
mixed effect models at the end.

#### Spending DF, T-Tests

Taking the mean and
standard deviation from a sample of size N from a single population,
we start with N DF, and 'spend' 1 of them on estimating the mean,
which is necessary for calculating the standard deviation.

S = sqrt( sum(x –
x-bar)^2 / (N-1))

The remaining N-1
can be 'spent' on estimating the standard deviation.

In a two sample
t-test setting, you need to estimate the difference (or, more
generally, a contrast), between the means of two different
populations. This test uses samples of size N1 and N2 from these two
populations respectively. That implies that you have N1 + N2 degrees
of freedom, and that you spend 2 of them estimating the 2 means. The
remaining N1 + N2 - 2 can be used on estimating the uncertainty. How
that N1 + N2 - 2 is spent depends on your assumptions about the
variance. If you assume that both groups have the same variance, then
you can spend all (N1 + N2 - 2) DF on estimating that one ‘pooled’
variance.

If you do not assume
equal variance between the two populations, you need spend (N1 - 1)
of the DF on estimating the standard deviation of population 1, and
(N2 - 1) on estimating the standard deviation of population 2. We’re
still estimating a single contrast between the population means, and
we need to apply a single t-distribution to the contrast.

How much we know
about the standard deviation of this contrast depends on how much
information we have about each of the two standard deviations. If we
don't have a computer on hand, we can rely on the worst-case
scenario, which is that we know only as much as what we know about
the smallest of the two samples, that is min(N1 - 1, N2 - 1) DF. More
commonly, we calculate a 'DF equivalent' based on how close the two
variance estimates are. The closer the estimates are, the closer to
the ideal (N1 + N2 - 2) DF we assume that we have.

#### Spending DF, ANOVA

In a One-Way ANOVA
setting for k means. we have samples of size N1, N2, ... , Nk from
each of k populations, respectively. That implies that we have (N1 +
N2 + ... Nk) DF to work with. Let's call that N DF for simplicity.

A One-Way ANOVA is a
comparison of the group means to the grand mean (mean of ALL
observations). So we need 1 DF for the grand mean, and (k-1) DF for
the k group means. Why k-1? Because the last group mean can be
estimated from the other groups and the grand mean. In other words,
we get it 'for free'. These (k-1) DF are spent on measuring the
standard deviation BETWEEN the groups.

That leaves (N-k) DF
for estimating the standard deviation WITHIN each group. Note that
ANOVA requires the assumption that all the groups have equal
variance, such that we use all the remaining degrees of freedom to
estimate that collective standard deviation.

#### Spending DF, Regression

In a simple linear
regression setting, we have N independent observations, and each
observation has two values in an (x,y) pair. We need to estimate the
slope and the intercept, so that's 1 DF each, or 2 DF total. That
leaves (N-2) DF for estimating the uncertainty.

With linear
regression, we also have a nice geometric interpretation of DF. A
line can always be fit through two points. If we have N points, then
we use 2 of them to fit a line, and the remaining N-2 points
represent random noise.

With multiple
regression, we have p ‘slope’ parameters and a sample of N. In
this case, we start with N DF, spend 1 DF on the intercept, and p DF
on the slopes, leaving us with (N - p - 1) DF to estimate
uncertainty.

#### Spending DF, Chi-Squared Tests

With
t-tests, ANOVA, and regression, we are essentially finding the
degrees of freedom to use as a parameter in one or two
t-distributions. Also, the observed responses (y variables) in these
cases are composed of continuous, numeric values. When the responses
are categorical, the situation is radically different.

There
are two commonly used tests conducted on categorical variables using
the chi-squared statistic: Goodness-of-fit tests and independence
tests, also called one-way and two-way chi-squared tests,
respectively. Both of these tests are calculated by finding the
expected number of responses for each category, and comparing them to
the observed responses:

Chi-squared
= sum( O – E)^2 / E

For
the one-way / goodness-of-fit test, we have one categorical variable
of C categories. The total of the observed counts O and the expected
counts E both need to add up to the sample total of observations N.
We need the total N in order to find the expected counts E, just like
how we need the mean x-bar in order to find the sample standard
deviation s in the numerical case.

As
such, once you have the total and O and E for the first C-1
categories, you automatically have it for the last category.
Analogously to the standard deviation situation, this means we have
C-1 degrees of freedom in a one-way chi-squared test.

We have C categories with numbers in them,
but we need to spend 1 DF on finding the total, leaving (C – 1) DF
for estimating uncertainty.

For
the two-way / independence test, we have two categorical variables of
C ‘column’ and R ‘row’ categories each respectively. That
implies that there are C*R combinations of categories. The expected
counts for each combination, or cell, are computed from the R row
totals and the C column totals. There is a bit of redundancy, so
that’s actually R + C – 1 independent pieces of information.

We have C*R cells of information, but for
doing a test of independence, we need R + C – 1 pieces of
information from the row and column totals. That leaves (C*R – R –
C – 1) or (C-1)*(R-1) degrees of freedom to spend on uncertainty
quantification.

Haven't quite had your fill? Here's some theory. |

#### Fractional Degrees of Freedom

One particular
thorny notion about equivalent degrees of freedom is that we can end
up working with a number of degrees of freedom that are not whole
numbers. Given that each independent data point yields 1 DF, that's a
little bizarre.

First, we calculate
equivalent degrees of freedom, sometimes we're using it to calculate
something that is a composite of two or more measures. The contrast
(e.g. difference) between the two means in the two sample t-test, for
example, involves calculating two different standard deviations, so
we're already straying from that idea of 'the amount of information
going into a single estimate'.

Second, that word
INDEPENDENT is a big one. If we have 10 completely independent
observations, then we have a sample of size N=10. But, if those
observations are correlated in some way (e.g. in a time series, like
the day-to-day average temperature), then each new recorded number
isn't giving as much information as a completely independent
observation. In cases like this, we sometimes calculate an 'effective
sample size', which would be somewhere between 1 and N, depending on
how correlated the observations were. That effective sample size
doesn't have to be a whole number, so neither do the degrees of
freedom calculations that are derived from it. (For more on effective
sample size, see psuedoreplication).

Thirdly,
mathematically, there often isn't a problem with using a non-whole
number of degrees of freedom. Both the t-distribution and the
chi-squared distribution work just as well with DF = 3.5 as it does
with DF =3 or DF = 4.

#### Saturation, DF Bankruptcy

If we ever have 0 DF
left over after estimating all the means, slope parameters, or
another other parameters, then we have what's called a saturated
model. In chemistry, a saturated solution is one that is holding all
the dissolved material that it can. A saturated model is one that is
estimating all the parameters that it can. There is nothing left to
measure uncertainty in those estimates.

For a saturated
ANOVA, we can estimate each of the group means, but we have no way of
knowing how good those estimates are. For a saturated regression, we
can get the intercept and the slope, but we have no way of knowing
how uncertain we should be about those estimates.

In a saturated
model, things like confidence intervals, standard errors, and
p-values are impossible to obtain.

One common solution
to saturation is to impose additional assumptions or restrictions on
the model. In an ANOVA, we might use a fractional factorial model and
not bother to estimate certain high-level interactions. In a
regression, we might treat a set of group effects as random effects,
and not consider them when trying to fit the line of best fit.

In the fractional
factorial case mentioned in ANOVA, this is for multi-way ANOVAs, but
one-way ANOVAs, and the solution is to simply assume that some
higher-order interactions are zero. If you assume they are zero, you
don't need to estimate them.

The LASSO, a
regression-like method that can handle situations where the number of
possible parameters p is greater than the sample size N, works on a
similar principle: it assumes that most of those possible parameters
are zero, thus saving the degrees of freedom necessary to estimate
them.

#### Mixed-Effects and REML

For the regression
case without random effects, the slopes are traditionally estimated
using a method based on maximum likelihood or ML. In lay-terms, ML is
"given then data that we observe, what are the parameter values
that would have the highest chance of producing data like this".

When we introduce
random effects, REML is used instead, which is short for Restricted
Estimation of Maximum Likelihood. In this case, we only estimate the
non-random effects (that is, the fixed effects, the ones we actually
care about) using maximum likelihood, and then assign the random
effects as after-the-fact adjustments to our predictions. By not
using the random effects in fitting the model, we don't need to spend
any degrees of freedom to estimate them, and we can save those
degrees of freedom for estimating uncertainty instead. Thus either
preventing saturation, or giving better confidence intervals,
standard errors, and p-values. The trade-off is that we still have no
uncertainty measures for the random effects, but that's an acceptable
issue in many cases.