Section 1: Introduction – The Problem of Binned Data
Hypothetically, say you’re given data like this in Table 1 below, and you’re asked to find the mean:
Group

Frequency

0
to 25

114

25
to 50

76

50
to 75

58

75
to 100

51

100
to 250

140

250
to 500

107

500
to 1000

77

1000
to 5000

124

5000
or more

42

Table 1: Example Binned Data.
Border cases go to the lower bin.
The immediate problem is that the mean (and the variance, and many other statistics) is the average of exact values by we have ranges of values. There are a few things similar to getting the mean that could be done:
1. Take the median instead of the mean, which is somewhere in the ‘100 to 250’ bin, but higher than most of the values in that bin. The answer is still a bin instead of a number, but you have something to work with.
2. Assign each bin a number such a '0 to 25' response would be 1, a '25 to 50' response would be 2, and so on to 9. One could take the mean of the bin numbers and obtain an 'average' bin, in this case 4.93. This number doesn't have clear translation to the values inside the bins.
2. Assign each bin a number such a '0 to 25' response would be 1, a '25 to 50' response would be 2, and so on to 9. One could take the mean of the bin numbers and obtain an 'average' bin, in this case 4.93. This number doesn't have clear translation to the values inside the bins.
3. Impute each binned value to the bin's midpoint and take the mean. Here, a '0 to 25' response would be 12.5, a '25 to 50' response would be 37.5, ... , a '1000 to 5000' response would be 3000. This poses two problems right away: First, unbounded bins like '5000 or more' have ambiguous midpoints. The calculated mean is sensitive to the choice of imputed upper bound. Taking the upper bound literally as infinity yields an infinite mean estimate. The second problem is that midpoints are not realistic. Consider that the size of the bins is increase approximately exponentially but that the frequencies are not. That implies that as a whole, smaller values are more common than larger values within the population; it’s reasonable to assume that this trend would be true within bins as well.
Better than all of these solutions is to fit a continuous distribution to the data and derive our estimates of the mean (and any other parameter estimates) from the fitted distribution
Section 2: Fitting A Distribution
We can select a distribution, such as an exponential, gamma, or lognormal, and find the probability mass that lands in each bin for that distribution. For example, if we were to fit an exponential with rate = 1/1000 (or mean = 1000), then
( 1  exp(.025)) of the mass would be in the '0 to 25' bin,
(exp(.025)  exp(.050)) would be in the '25 to 50' bin,
(exp(1)  exp(5)) would be in the '1000 to 5000' bin, and
exp(5) would be in that final '5000 or more' bin. No arbitrary endpoint needed.
We can then choose a parameter or set of parameters that minimizes some distance criterion between these probability weights and the observed relative frequencies. Possible criteria are the KullbackLeibler (KL) divergence, the negative loglikelihood, or the chisquared statistic. Here I'll use the KL divergence for demonstration.
The KullbackLeibler divergence gives a weighted difference between two discrete distributions. It is calculated as \sum_i{ p_i * log(q_i / p_i) }, where p_i and q_i are probabilities of distributions p and q at value i, respectively. (There is also a continuous distribution version of the KL divergence, which is calculated similarly, as shown in https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence )
( 1  exp(.025)) of the mass would be in the '0 to 25' bin,
(exp(.025)  exp(.050)) would be in the '25 to 50' bin,
(exp(1)  exp(5)) would be in the '1000 to 5000' bin, and
exp(5) would be in that final '5000 or more' bin. No arbitrary endpoint needed.
We can then choose a parameter or set of parameters that minimizes some distance criterion between these probability weights and the observed relative frequencies. Possible criteria are the KullbackLeibler (KL) divergence, the negative loglikelihood, or the chisquared statistic. Here I'll use the KL divergence for demonstration.
get.KL.div.exp = function(x, cutoffs, group_prob)
{
exp_prob= diff( pexp(cutoffs, rate = 1/x))
KL = sum(group_prob * log(group_prob / exp_prob))
return(KL)
}
result = optim(par = init_par_exp, fn = get.KL.div.exp, method = "Brent", lower=1, upper=10000, cutoffs=cutoffs, group_prob=group_prob)
Here, 'cutoffs' is a vector/array of the boundaries of the bins. For the largest bin, an arbitrarily large number was used (e.g. the boundary between the largest and secondlargest bins times 1000). The vector/array group_prob is the set of observed relative frequencies, and exp_prob is the set of probabilities of the exponential distribution that fall into each bin. The input value x is the rate parameter, and the returned value KL is the KullbackLeibler divergence. The input value init_par_exp is the initial guess at the rate parameter, and it can be almost anything without disrupting optim().
In the example case of Table 1, the optim function returns an estimate of 1/743 for the rate parameter (or 743 for the 'mean' parameter), which translates to an estimate of 743 for the mean of the binned data. The fitted distribution had a KLdivergence of 0.4275 from the observed relative frequencies.
For comparison, the midpointbased estimate of the mean was 1307, using an arbitrary largestvalue bin of '5000 to 20000'.
Section 3: Selecting a distribution
We don’t yet know if the exponential distribution is the most appropriate for this data. Other viable possibilities include the gamma and the lognormal distributions.
We can fit those distributions in R with the same general approach to minimizing the KL divergence using the following code snippets. For the lognormal, we use
get.KL.div.lognorm = function(x, cutoffs, group_prob)
{
exp_freq = diff( plnorm(cutoffs, meanlog=x[1], sdlog=x[2]))
KL = sum(group_prob * log(group_prob / exp_freq))
return(KL)
}
result_lnorm = optim(par = c(init_mean_lognorm, init_sd_lognorm), fn = get.KL.div.lognorm,
method = "NelderMead", cutoffs=cutoffs, group_prob=group_prob)
Where x is the vector of the two parameters, mu and sigma, for the lognormal, and where 'init_mean_lognorm' and 'init_sd_lognorm' are the initial parameter estimates. We derived initial estimates with the method of moments and the bin midpoints. Also note that the optimization method has changed from the single variable ‘Brent’ method to the multiple variable ‘NelderMead’ method. The respective code snippet for the gamma distribution follows.
get.KL.div.gamma = function(x, cutoffs, group_prob)
{
exp_freq = diff( pgamma(cutoffs, shape=x[1], scale=x[2]))
KL = sum(group_prob * log(group_prob / exp_freq))
return(KL)
}
result_gamma = optim(par = c(init_shape_gamma, init_scale_gamma), fn = get.KL.div.gamma, method = "NelderMead", cutoffs=cutoffs, group_prob=group_prob)
Here, x is the vector of the shape and scale parameters, respectively. As in the lognormal case, the initial parameter values are estimated using the method of moments derived from the midpointofbin case.
The results of fitting all three distributions are shown in Table 2 below.
Distribution

Initial Parameters

Final Parameters

KLDivergence

Est. Of Mean

Exponential

Mean = 1307

Mean = 743

0.4275

743

Gamma

Shape = 0.211
Scale = 6196

Shape = 0.378
Scale = 2317

0.0818

876

LogNormal

Mean = 7.175
SD = 7.954

Mean = 5.263
SD = 1.932

0.0134

1241

The choice of distribution matters a lot. We get completely different estimates for each distribution. In this case, the low KL divergence of the lognormal distribution makes a compelling case for the lognormal to be the distribution we ultimately use.
Figure 1 further shows the aptitude of the lognormal for this data. Note the log scale of the x.
Figure 1 further shows the aptitude of the lognormal for this data. Note the log scale of the x.
Figure 1 
Section 4: Uncertainty estimates.
We have point estimates of the population parameters, but nothing yet to describe their uncertainty. There are a few sources of variance we could consider, specifically:
 The uncertainty of the parameters given the observations,
 The sampling distribution that produced these observations.
Methods exist for finding confidence regions for sets of parameters like the gamma. (See: https://www.tandfonline.com/doi/abs/10.1080/02331880500309993?journalCode=gsta20 ) , from which we could derive confidence intervals for the mean, but at this exploratory stage of developing the ‘debinning’ method, I’m going to ignore it.
The uncertainty from the sampling distribution can be addressed with a Monte Carlo approach. For each of many iterations, generate a new sample by putting observations in bins with probability proportional to the observed frequencies. Then using the debinning method described in the previous sections to estimate parameters of the fitted distribution and the sample mean. Over many iterations, the parameter estimates will form their own distribution for which you can draw approximate confidence intervals and regions.
For the lognormal distribution applied to our example case in Table 1, we find the results shown in Table 3 and Figures 2 and 3, all after 3000 resamplings of the observed data with replacement.
We get some idea of the scale of the uncertainty. Since we haven't incorporated all the variation sources, these confidence intervals are optimistic, but probably not overly so.
Section 5: Discussion – The ToDo list.
 Examine more distributions. There are plenty of other viable candidate distributions, and more can be made viable with transformations such as the logarithmic transform. There are often overlooked features of established distributions, such as their noncentrality parameters, that open up new possibilities as well. There are established distributions with four parameters like the generalized beta that could also be applied, provided they can be fit and optimized with computational efficiency.
 Address Overfitting. This wide array of possible distributions brings about another problem: overfitting. The degrees of freedom available for fitting a distribution is only the number of boundaries between bins, or B – 1 if there are B bins. The example given here is an ideal one in which B=9. In many other situations, there are as few as 5 bins. In these cases, a fourparameter distribution should be able to fit the binned data perfectly, regardless of the distribution’s appropriateness in describing the underlying data.
 Estimation of values within categories. If we have a viable continuous distribution, then we have some information about the values within a bin. Not their exact values, but a better idea than simply “it was between x and y”. For starts, we can impute the conditional mean within each category with a bounded integral, and right away this is a better estimate to use than the midpoint when using exact values.
 Incorporation of covariates. We can go further still in estimating these individual values with bins by looking at their predicted bin probabilities, in turn directed from ordinal logistic regression. For example, an observation that is predicted to be in a higher bin than it actually is, is in turn likely to have a true value close to the top of that bin.
In short, there’s a golden hoard worth of information available if only we are able to slay the beast of binning* to get to it.
* Feel free to use ‘dragon of discretization’ in place of this.
Appendix: Full R Code Available Here 
No comments:
Post a Comment