Does this look normal to you? (Q-Q plots)


One question that always seems to come up in introductory statistics classes is how to tell if a sample is “normal enough” from a normal quantile (or Q-Q) plot. Students are told that the points on the plot should form a more or less straight line, and that they should be especially wary of bow shapes and S shapes, which signify departures from normality in the forms of skew and tail weight. When there are plenty of sample points this often isn’t a problem, but when there are relatively few (say, around 30, often cited as a rough minimum for making normality assumptions) there will usually be noticeable deviations from the diagonal, even when the underlying population really is normal. So, how can we help students use Q-Q plots before their intuition has been properly built up?

img title Can you tell which plot does not come from a normal sample? Answer: the bottom right is from a gamma(4, 4) distribution.

We could compare the Q-Q plot to some of the same size sampled from a normal population with mean and standard deviation equal to those of the sample, but perhaps we could do one better by seeing if we can identify the sample under study from among normal samples.

Here’s a function qqfind() that generates randomly hides the sample Q-Q plot among three Q-Q plots of samples from the hypothetical normal population. We could say that if we’re unable to clearly identify the real sample, then it’s “normal enough”.

qqfind <- function(x) {
  mean <- mean(x)
  sd <- sd(x)
  par(mfrow=c(2, 2))
  real <- sample(1:4, 1)
  for (i in 1:4) {
    if (i == real) {
      qqnorm(x)
      qqline(x)
    } else {
      y <- rnorm(length(x), mean, sd)
      qqnorm(y)
      qqline(y)
    }
  }
  par(mfrow=c(1, 1))
  return(real)
}

The following code generates the figure above and the two below.

set.seed(1)

data.gamma <- rgamma(30, 4, 4)
data.binom <- rbinom(30, 30, 0.3)
data.cauchy <- rcauchy(30)

real.gamma <- qqfind(data.gamma)
real.binom <- qqfind(data.binom)
real.cauchy <- qqfind(data.cauchy)

img title The binomial sample stands out because of the staircase pattern, an artifact of the discrete sample space.

img title The heavy tails of the Cauchy sample are quite evident.