One of my recent projects used a simple Bayesian version of the linear model.
While the function I wrote to fit the model certainly worked, I was interested in making it more robust and easier to use by imitating the form of R’s built-in
lm() function, which handles the frequentist linear model.
In this post I’ll take apart the existing
lm() function, describe a simple Bayesian linear model with conjugate priors, and build a counterpart
Along the way I’ll go through some properties of the R language that don’t come up in everyday data analytical use, but can be immensely helpful when writing your own methods.
Specifically, I’ll explain how to take a full data frame and use a model formula and a subsetting criterion to produce a response vector and a design matrix that can be used directly for model fitting.
The source code of any function in R can be viewed by entering the name of the function into the console.
Sometimes the source code isn’t terribly helpful, especially if it just sends you to a C call, but we get lucky in the case of
lm(), which provides a good explanation of what’s going on.
If you look closely, it becomes apparent that
lm() doesn’t actually perform any of the actual model fitting, but is really a wrapper function that converts the arguments into a friendly format for
lm.fit(), which is called on line 45.
This is okay with us, because the method for fitting the Bayesian model will be entirely different from the method for fitting the frequentist model.
We will probably have a
blm.fit() function that does all the fitting work, given data in a friendly format, and the user-facing function
blm() that makes the whole process easy for the user.
blm() function performs much of the same functions as
lm() and thus could look fairly similar.
First, we’ll take a look at the list of arguments (formula, data, subset, etc.).
A description of the purpose of each argument is given in the help file, which can be accessed by typing
The arguments we’ll be focusing on are the following:
formula, which specifies a model structures, e.g.
Ozone ~ Wind + Temp + Wind:Temp,
data, a data frame containing the observations,
subset, which indicates which observations from
offset refer to more complicated ways of running a model, and are outside the scope of this post.
The rest of the arguments primarily tell the function what sorts of additional information to return at the end, and will be dealt with later.
Constructing model matrices
lm.fit() function, like our soon-to-be
blm.fit() function, takes as arguments a response vector
y and a design matrix
This is the mathematically convenient starting point for a model fit, but can be cumbersome to a user.
Part of the job of
lm() is to accept the complete data frame, take only a specified subset of observations (rows), and then use the given model formula to produce a response vector and a design matrix for
Much of the actual work toward this is done by the
model.frame() function, which takes a formula, data frame, subsetting criterion, and instructions on what to do with unused levels of factors, and returns a data frame containing the variables used in the formula along with information on how to extract the response vector and design matrix.
lm(), a good chunk of code takes the arguments passed in the original function call and builds it into a call to
model.frame() which is then evaluated within the environment from which
lm() was called.
I’ve reproduced the corresponding section of code and commented each step.
There’s some more logic for handling edge cases in
lm() before extracting the response vector and design matrix, but I’ve pulled out the parts I’m interested in and commented them.
It turns out that once we have the model matrix, extracting these objects is quite easy.
If we’re not dealing with contrasts, we can drop the final argument from
We can now pass
lm.fit() as the design matrix and response vector, respectively.
lm function returns a lot of information encapsulated in an object of class
The minimum attributes required by the specification of the
"lm" class are the following:
coefficients, a named vector of coefficients,
residuals, the residuals, that is response minus fitted values,
fitted.values, the fitted mean values,
rank, the numeric rank of the fitted linear model,
weights, (only fir weighted fits) the specified weights,
df.residual, the residual degrees of freedom,
- several more items that simply regurgitate information passed to the function or used internally
In fact, items 1–6 are returned by
lm.fit() and are simply passed along.
The values in item 7 include things like the original function call, response vector, and design matrix, the latter two returned only if the
x arguments are set to true in the function call, respectively.
The following excerpt form
lm() creates a list
z of the above items and assigns it the
"lm" class designation, then returns it.
A simple Bayesian linear model with conjugate priors
Supposing we have a response vector Y and a design matrix X, we can specify the Bayesian hierarchical linear model
in which β is a vector of regression parameters, σ2 is the variance parameter, and Σ, ν, R, a0, and b0 are hyperparameters assumed known. In the usual uncorrelated data case, Σ is the identity matrix. For vague priors, we can set ν to be the zero vector, R to be a diagonal matrix with very large diagonal entries, and a0 = b0 very small.
Given the observed response vector y, the joint posterior distribution is
The marginal distribution of β is the multivariate Student’s t distribution,
All the information we need to make inferences is contained in these posteriors.
The first order of business in creating the
blm.fit()) function is to determine what sort of input we want to take and what should be returned given that input.
In order to maintain the user-friendliness of
lm(), we want to take the three basic arguments: a model formula, an optional full data frame, and an optional subset vector that can alternatively be specified as an inclusion criterion.
In order to fit the Bayesian model, we also need the hyperparameters.
Thankfully we can construct some default values which lead to sensible vague priors most of the time.
This will allow users to call
blm() just as they would call
lm(), ignoring prior specification.
However, it is important to allow the manual specification of priors through setting hyperparameters, and we must think carefully about how those should be passed as arguments.
We want the hyperparameter arguments to be clearly named and flexible in use, allowing each argument to be included or excluded separately, and for variances to be specified both as diagonals (for independence models) and as full matrices (for correlated models).
I choose the following arguments and default values:
prior.mean = 0, a scalar or vector, corresponding to ν, which determines the prior mean of the regression coefficients (if a scalar, treat as a vector with all elements equal to that scalar),
prior.precision = 0.0001, a scalar, vector or matrix, corresponding to R-1, which determines the prior precision of the regression coefficients (if a vector, R-1 is diagonal with the given vector as the diagonal, if a scalar, treat as a vector with all elements equal to that scalar),
prior.df = 0.0001, a scalar, corresponding to 2a0, which determines the corresponding degrees of freedom (and can be thought of as the number of “prior observations”),
prior.scale = 1, a scalar, corresponding to sqrt(b0 / a0), the square root of the harmonic mean of the prior mode and prior mean, when the latter exists, of σ2,
cov.structure = 1, a scalar, vector or matrix, corresponding to Σ, which determines the conditional covariance structure (but not magnitude) of the responses (if a vector, Σ is diagonal with the given vector as the diagonal, if a scalar, treat as a vector with all elements equal to that scalar).
As a courtesy to those who wish to specify the prior on σ2 directly, we’ll include optional arguments
prior.b which can be used as alternatives to
prior.scale as long as arguments from only one pair are used.
prior.scale are specified, we compute
prior.a <- prior.df / 2 and
prior.b <- (prior.df * prior.scale ^ 2) / 2.
The prior coefficient precision matrix is used instead of the prior coefficient covariance matrix because it’s the traditional Bayesian parameterization of normal priors.
prior.scale are the defaults because they are more intuitive interpretations than
Finally, we have the placeholders
n in the default values.
This is the number of regression parameters, which is equal to the number of columns in the design matrix, and the number of observations, equal to the number of rows of the design matrix, respectively.
More on prior specification
Things can get a little tricky with the prior specifications because of the way terms in model formulae are translated into the design matrix.
Roughly, the columns of the design matrix correspond to the terms of the model formula in order of appearance.
The nuances most commonly affecting this heuristic are the fact that the intercept term implicitly comes first, and that interactions specified as
a * b are translated to
a + b + a:b.
However, I don’t know of a better way to specify the priors.
model.matrix() converts categorical covariates into a set of indicator covariates with the first level treated as the baseline.
Levels are taken in the order returned by
The specification of the prior must account for this.
All this can be avoided in many use cases by passing the arguments as scalars, as they’ll automatically be expanded to the correct forms. However, care must be taken when specifying more complicated priors.
The return value of
blm() will use the return value of
lm() as a starting point.
We’ll return an object of class
"blm" in which objects of class
"blm" contains the following components:
coefficients, a named vector of coefficients,
residuals, the residuals,
fitted.values, the fitted mean values,
prior.params, a list containing the prior hyperparameters passed as arguments, expanded to full form where applicable,
posterior.params, a list containing the posterior hyperparameters,
cov.structure, the conditional covariance structure passed as an argument, expanded to full form,
call, the matched call,
model, the model frame used.
The values of
cov.structure are produced within the body of
blm() during the handling of passed arguments, and
posterior.means will be returned by
blm.fit() along with the model coefficients, residuals, fitted values, etc.
fit <- lm(), the result is often then either printed immediately using
fit or summarized using
There are other methods that are often used, for example
vcov(), but once the basics of creating new methods in R are learned, creating these particular methods for class
"blm" is straightforward.
When a user types
fit, the function that’s actually called is
print(fit), which is dispatched to
print.lm(fit). We then need to define the function
summary() function works similarly. For now we’ll just print out the function call, a coefficient table containing posterior means and (marginal) standard errors, an estimate of the variance parameter, and the degrees of freedom.
A test run
Now that we have everything we need, let’s run a basic test of our code.
> summary(fit.b) Call: blm(formula = Ozone ~ Wind + Temp + Wind:Temp, data = airquality) Coefficients: Post. Mean Marg. Post. SE (Intercept) -248.3768291 47.70604708 Wind 14.3237294 4.20065758 Temp 4.0740781 0.58223882 Wind:Temp -0.2237744 0.05350456 Variance: 403.3925 Residual degrees of freedom: 112.0001 > summary(fit.f) Call: lm(formula = Ozone ~ Wind + Temp + Wind:Temp, data = airquality) Residuals: Min 1Q Median 3Q Max -39.906 -13.048 -2.263 8.726 99.306 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -248.51530 48.14038 -5.162 1.07e-06 *** Wind 14.33503 4.23874 3.382 0.000992 *** Temp 4.07575 0.58754 6.937 2.73e-10 *** Wind:Temp -0.22391 0.05399 -4.147 6.57e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 20.44 on 112 degrees of freedom (37 observations deleted due to missingness) Multiple R-squared: 0.6261, Adjusted R-squared: 0.6161 F-statistic: 62.52 on 3 and 112 DF, p-value: < 2.2e-16
Everything matches up! There’s certainly some more work to be done as far as handling edge cases and making the output cleaner and more informative, but this constitutes a solid foundation on which to build.