Generalized Linear Models (GLM) refers to a large class of models which include the familiar ordinary linear regression — ordinary least squares (OLS) regression — and the analysis of variance (ANOVA) models.
A bag loaded with tricks (models, rather)
Both OLS regression and ANOVA deal with continuous response variables. However, there are times that we need to predict a categorical response variable, for example, yes/no responses and count data.
For this purpose, other models like logit, log-linear, and probit model, just to name some, will be appropriate.
Yes, there’s a lot of models in the GLM domain. It is easy to get confused on what type of statistical model is suitable for the data you have on hand.
Here, we will dispel this confusion by getting to know GLMs a bit more.
What makes up a GLM?
There are three ingredients that make up a GLM:
- Random component
- Linear predictor
- Link function
By knowing these three components, we can be guided on what type of model we should be using for our data.
Let’s look at them one by one.
Component 1: Random component
The random component pertains to the response variable we are trying to model. Let’s call this variable Y.
This variable Y is assumed to follow a particular probability distribution. Here are some examples.
Response Variable (Y) | (Usually assumed) Distribution |
Number of successes in a given number of trials |
Binomial |
Counts | Poisson, Negative Binomial |
Continuous observation (e.g., weight) | Normal, Gamma |
Categorical data have a nominal or ordinal scale of measurement. Interval and ratio data are both continuous.
If you are having trouble recognizing whether a variable is categorical or continuous, this explainer on the levels of measurement might help.
Component 2: Linear predictor
The linear predictor in a GLM will specify the explanatory variables, also known as predictors. It follows the form:
α + β1x1 + β2x2 + … + βpxp
The x’s in the equation are the values of the predictors that you have specified.
For example, you might be interested in predicting the tendency of a person to vote or not to vote for a presidential candidate.
The response variable is then a yes/no variable, depending on whether a person will vote or not.
What can be potential predators? It could be their party of choice, their economic status, their level of education, just to name a few.
Note that the equation above is a linear equation. This is the “linear” in “generalized linear models”. It pertains to how the predictors enter the model in a linear fashion.
Component 3: Link function
Now note that whenever we predict the response variable, we are predicting its mean or average value.
The link function is simply some function involving the mean response. Let’s denote this as ?.
Here are some common functions of ? used as link functions along with their names:
Function | Link Type |
? | Identity link |
log(?) | Log link |
log[?/(1-?)] , also known as logit (?) | Logistic or logit link |
To complete the GLM, we equate the link function to the linear predictor.
For instance, a GLM using the log link with two predictor variables will look like this:
log(?) = α + β1x1 + β2x2
The GLM above is an example of a log-linear model.
Which is which?
To wrap things up, here is a quick summary of what model you should use depending on the nature of the three components we have discussed:
Random Component | Predictors | Link Function | Model to use |
Normal | Continuous | Identity | Linear Regression |
Normal | Categorical | Identity | ANOVA |
Normal | Mixed | Identity | Analysis of Covariance (ANCOVA) |
Binomial | Mixed | Logit | Logistic |
Poisson | Mixed | Log | Log-linear |
Of course, this is only a selected few in the large selection of models that belong to the class of GLMs. They are among the typically used in practice, which is why they were chosen to be shown here.
Welcome to the world of GLMs. This primer is just the beginning — there is a long way ahead towards mastery. You are off to a good start.