Wednesday, May 04, 2016

When to use a lognormal distribution rather than a normal distribution

Recently a reviewer commented that perhaps a lognormal distribution rather than normal distribution should be used in order to avoid truncation of negative values. Now I had used lognormal distributions in the past, but only so that I could to describe a positively skewed distribution.

So I did some reading and found a paper by Limpert et al. (2001) from which I picked out a few key points:

  1. The reviewer was quite correct, and unlike a normal distribution, a lognormal distribution is incapable of producing random numbers below zero.
  2. As the mean gets larger and the standard deviation gets smaller, the normal and lognormal distributions become identical.
  3. The two distributions are defined in a similar manner. The normal distribution is defined by the arithmetic mean and standard deviation, and the lognormal distribution is defined by the geometric mean and standard deviation.

To get my head around how the normal and lognormal distributions behaved, and how to generate and fit data in R, I knocked up some example scripts. I didn't fix the random seed value as I wanted to see how the patterns varied with different random samples, so if you run the scripts yourself you will get different results to those shown here, but the patterns will be broadly the same.

# Define some values to work with
v = c(9.36, 8.06, 7.93)
 
# Set up things for plotting
x = seq(-5, 20, 0.1)
png("example1.png", width=15/2.54, height=12/2.54, 
    res=100, units="in", pointsize=8)
par(oma=c(2,0,0,0), mar=c(3,3,2,0.5), mgp=c(2,0.55,0), mfrow=c(2,1))
 
# NORMAL DISTRIBUTION
# Calculate arithmetic mean and standard deviation
aM = mean(v)
aS = sd(v)
# Generate some random numbers
normal = rnorm(1000, mean = aM, sd = aS)
# Plot a histogram
hist(normal, breaks=seq(-5,50,0.5), xlim=c(-5,20), 
     col="dodgerblue", border="white",
     main="Normal", xlab="Values")
# Create a probability density values and add to plot
normY = dnorm(x, mean = aM, sd = aS)
par(new=TRUE)
plot(x, normY, type = "l", xlim=c(-5,20), col="red", bty="n",
     xaxt="n", yaxt="n", ylab="", xlab="")
 
# LOGNORMAL DISTRIBUTION
# Calculate geometric mean and standard deviation
gM = exp(mean(log(v)))
gS = exp(sd(log(v)))
# Generate some random numbers
lognormal = rlnorm(1000, meanlog = log(gM), sdlog = log(gS))
# Plot a histogram
hist(lognormal, breaks=seq(-5,50,0.5), xlim=c(-5,20), 
     col="gold", border="white",
     main="Lognormal", xlab="Values")
# Create a probability density values and add to plot
lognormY = dlnorm(x, mean = log(gM), sd = log(gS))
par(new=TRUE)
plot(x, lognormY, type = "l", xlim=c(-5,20), col="red", bty="n",
     xaxt="n", yaxt="n", ylab="", xlab="")
 
dev.off()

In this example we can see that the two distributions describe the data in an almost identical manner, so either could be used. However, when the values are closer to zero, and the variation becomes larger, there is a marked difference in behaviour.

# Define some values to work with
v = c(2.49, 0.93, 4.08)
 
# Set up things for plotting
x = seq(-5, 20, 0.1)
png("example2.png", width=15/2.54, height=12/2.54, 
    res=100, units="in", pointsize=8)
par(oma=c(2,0,0,0), mar=c(3,3,2,0.5), mgp=c(2,0.55,0), mfrow=c(2,1))
 
# NORMAL DISTRIBUTION
# Calculate arithmetic mean and standard deviation
aM = mean(v)
aS = sd(v)
# Generate some random numbers
normal = rnorm(1000, mean = aM, sd = aS)
# Plot a histogram
hist(normal, breaks=seq(-5,50,0.5), xlim=c(-5,20), 
     col="dodgerblue", border="white",
     main="Normal", xlab="Values")
# Create a probability density values and add to plot
normY = dnorm(x, mean = aM, sd = aS)
par(new=TRUE)
plot(x, normY, type = "l", xlim=c(-5,20), col="red", bty="n",
     xaxt="n", yaxt="n", ylab="", xlab="")
 
# LOGNORMAL DISTRIBUTION
# Calculate geometric mean and standard deviation
gM = exp(mean(log(v)))
gS = exp(sd(log(v)))
# Generate some random numbers
lognormal = rlnorm(1000, meanlog = log(gM), sdlog = log(gS))
# Plot a histogram
hist(lognormal, breaks=seq(-5,50,0.5), xlim=c(-5,20), 
     col="gold", border="white",
     main="Lognormal", xlab="Values")
# Create a probability density values and add to plot
lognormY = dlnorm(x, mean = log(gM), sd = log(gS))
par(new=TRUE)
plot(x, lognormY, type = "l", xlim=c(-5,20), col="red", bty="n",
     xaxt="n", yaxt="n", ylab="", xlab="")
 
dev.off()

Now the randomly generated numbers from the normal distribution contain values that are negative, while in contrast the lognormal distribution produces values that are all above zero. This is quite a handy property for modelling, as there are occasions where data such as species abundances or total rainfall cannot be negative. If a normal distribution were used there would be the potential for illogical negative values to be randomly generated – which could cause merry chaos within a model!

So the take home message would seem to be that if your data should always be positive, then use a lognormal distribution rather than a normal distribution. This will prevent any chance of negative values being generated, and as the mean becomes larger and the standard deviation becomes smaller the lognormal distribution becomes identical to a normal distribution anyway.

Of course that is all well and good when you are modelling from your own data. But what if you want to use other published information that has been described in terms of an arithmetic mean and standard deviation rather than the geometric mean and standard deviation that you need for a lognormal distribution? Well, you can also estimate the geometric mean and standard deviation from the arithmetic mean and standard deviation. To demonstrate the following compares a lognormal distribution generated from some actual data and from a reported arithmetic mean and standard deviation.

# Define some values to work with
v = c(2.49, 0.93, 4.08)
 
# Set up things for plotting
x = seq(-5, 20, 0.1)
png("example3.png", width=15/2.54, height=12/2.54, 
    res=100, units="in", pointsize=8)
par(oma=c(2,0,0,0), mar=c(3,3,2,0.5), mgp=c(2,0.55,0), mfrow=c(2,1))
 
# LOGNORMAL DISTRIBUTION FROM CALCULATED VALUES
# Calculate geometric mean and standard deviation
gM = exp(mean(log(v)))
gS = exp(sd(log(v)))
# Generate some random numbers
lognormal = rlnorm(1000, meanlog = log(gM), sdlog = log(gS))
# Plot a histogram
hist(lognormal, breaks=seq(-5,50,0.5), xlim=c(-5,20), 
     col="dodgerblue", border="white",
     main="Lognormal calculated", xlab="Values")
# Create a probability density values and add to plot
lognormY = dlnorm(x, mean = log(gM), sd = log(gS))
par(new=TRUE)
plot(x, lognormY, type = "l", xlim=c(-5,20), col="red", bty="n",
     xaxt="n", yaxt="n", ylab="", xlab="")
 
# LOGNORMAL DISTRIBUTION FROM ESTIMATED VALUES
# Calculate arithmetic mean and standard deviation
aM = mean(v)
aS = sd(v)
# Estimate the geometric mean and standard deviation
gEstM = aM / sqrt(1 + (aS / aM) ^ 2)
gEstS = exp(sqrt(log(1 + (aS / aM) ^ 2)))
# Generate some random numbers
lognormal = rlnorm(1000, meanlog = log(gEstM), sdlog = log(gEstS))
# Plot a histogram
hist(lognormal, breaks=seq(-5,50,0.5), xlim=c(-5,20), 
     col="gold", border="white",
     main="Lognormal estimated", xlab="Values")
# Create a probability density values and add to plot
lognormY = dlnorm(x, mean = log(gEstM), sd = log(gEstS))
par(new=TRUE)
plot(x, lognormY, type = "l", xlim=c(-5,20), col="red", bty="n",
     xaxt="n", yaxt="n", ylab="", xlab="")
 
dev.off()

As can be seen, the two distributions are very similar. So while it would obviously be better to get the original data and base a lognormal distribution from a calculated geometric mean and standard deviation, if all that is available is a reported arithmetic mean and standard deviation, it should still be possible to produce a reasonable lognormal distribution for inclusion within a model.

References

Limpert E, Stahel WA, Abbt M (2001) Log-normal distributions across the sciences: keys and clues. Bioscience 51: 341-352.