Fitting distributions to data

by David Vose


A common problem in risk analysis is fitting a probability distribution to a set of observations for a variable. One does this to be able to make forecasts about the future. The most common situation is to fit a distribution to a single variable (like the lifetime of a mechanical or electrical component), but problems also sometimes require the fitting of a multivariate distribution: for example, if one wishes to predict the weight and height of a random person, or the simultaneous change in price of two stocks.

There are a number of software tools on the market that will fit distributions to a data set, and most risk analysis tools incorporate a component that will do this. Unfortunately, the methods they use to measure the goodness of fit are wrong and very limited in the types of data that they can use. This paper explains why, and describes a method that is both correct and sufficiently flexible to handle any type of data set.

Fitting a single distribution

The principle behind fitting distributions to data is to find the type of distribution (normal, lognormal, gamma, beta, etc) and the value of the parameters (mean, variance, etc) that give the highest probability of producing the observed data. For example, Figure 1 shows the normal distribution with parameters that best fit a particular data set. The data were randomly generated from a Normal distribution with mean and standard deviation of 4 and 1 respectively. The data set consists of 1026 values, which is many more than one usually has to work with, so the parameter estimates (4.026 and 1.038) are close to the true values.

Usually, of course, we do not know that the data came from any specific type of distribution, though we can often guess at some good possible candidates by matching the nature of the variable to the theory on which the probability distributions are based. The normal distribution, for example, is a good candidate if the random variation of the variable under consideration is driven by a large number of random factors (none of which dominate) in an additive fashion, whereas the lognormal is a good candidate if a large number of factors influence the value of the variable in a multiplicative way.

A number of other graphs can help you visualise how well the distribution matches the data…

This is just an excerpt from a full white paper

To download this and other white papers please fill in the form below so we can send you the download links of our white papers.