Linear regression in Python for Epidemiologists in 6 steps
In this tutorial we will cover the following steps:
1. Open the dataset
2. Explore data
3. Make a research question (that can be answered using a linear regression model)
4. Running and reading a simple linear regression
5. Running and reading a multiple linear regression
6. Answering our research question
Let’s get started!
1. Open the dataset
We are going to extract a publically available version of the Framingham heart study from; https://www.kaggle.com/amanajmera1/framingham-heart-study-dataset/version/1
On this website, you can also consult the coding for the different variables. For example, regarding gender (variable male; male=1 means male and male=2 means female)
First Download the file to your computer.
Here the Jupyter Notebook version (Dataset downloaded from Kaggle).
Here is the Kaggle Notebook so you can run everything on your browser.
First import all relevant packages
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
If it is your first time using StatsModels you may need to use this before (installing from Anaconda, Python distribution).
conda install -c conda-forge statsmodels
Then open the Framingham database (or data frame). We will name it df to make things easier.
df = pd.read_csv('framingham.csv')
From now every time we want to “call” our data frame, we will use df in our code.
Let’s take a look at the first 5 rows and all our columns.
df.head()
Let’s see how many columns and rows we have…
df.shape()
(4240, 16)
The output means the data frame has 4240 observations (rows) and 16 variables (columns). Remember .head() only gave us a glimpse from the 4240 observations/rows.
Now let’s describe our variables
df.describe()
The empty parenthesis in df. describe() tells Python to show all available variables.
Here we can see the number of observations per column (count), followed by the mean and standard deviation (sd). Note that for binary discrete variables the interpretation is somewhat different and the mean will equal the relative proportion (0 to 1, instead of 0–100%).
3. Make a research question
Is there an association between systolic blood pressure and age, after adjusting for relevant confounders?
How can we answer this?
Here we have a continuous outcome (systolic blood pressure in mmHg) and an independent continuous covariable (age in years).
Since systolic blood pressure (sBP) (variable named sysbp) has a normal distribution and we what to test an association we should go for linear regression.
We will consider the following available variables as potential relevant confounders for the association between sBP and age; gender (binary variable: male), education (categorical variable: education), smoking (continuous variable: cigsPerDay for cigarettes per day), intake of medication for hypertension (binary variable: BPMeds), and total cholesterol (continuous variable: totChol).
Before running the numbers we have to define a null hypothesis and an alternative hypothesis:
- Null hypothesis (H0): There is NO association between sBP and age (considering the chosen confounders);
- Alternative hypothesis (H1): There is an association between sBP and age (considering the chosen confounders).
4. Running and reading a simple linear regression
Let’s go for a simple linear regression.
Let’s describe the model. A classical alternative in data science would be to use the Scikilt-learn package. We are using instead the StatsModels package, which is more user-friendly for those from an Epidemiology background already familiarized with other languages and programs, such as Stata and R/RStudio.
model1 = smf.ols(formula='sysBP ~ age', data=df).fit()
- smf calls the package Statsmodel
- ols tells Python we are using an Ordinary Least Square (OLS) regression (a type of linear regression)
- formula= used to write the dependent and all the independent variable(s)
- first variable inside the parenthesis before “~”/dependent variable/outcome: The first variable is our only dependent variable. This is our outcome, the variable that determines which type of regression to use, and the one to be associated with all other covariates;
- ~ inside parenthesis: Marks the border between the outcome (dependent variable) to the left, and the covariates (independent variables) to the right;
- independent covariates/independent variables: All other variables after the “~”, inside parenthesis;
- + sign inside parenthesis: the + sign is used to separate different independent variables inside the same model (useful for multivariable models, aka many independent variables)
- ,data=: This marks the name of the data frame
- .fit() tells Python we want to fil our function (“run the function”)
model1 = smf.ols(formula='sysBP ~ age', data=df).fit()
print(model1.summary())
On the upper half of the output(inside the red square) we can see some important general data about the model.
On the left column, we see
- The dependent variable is sysBP
- We are running an ordinary least square (OLS) regression model (a form of linear regression)
- We included 4240 observations (patients).
On the right column, we can see some important information regarding the model diagnosis:
- R-squared = 0.155, meaning 16% of the outcome variability is explained by the model. The R-squared ranges from 0-1, and measures how close the data are to the fitted regression line. In other words, it indicates with %, the variability of the outcome (response data) is explained by the model;
- The F-statistic value is used for the calculation of the p-value of the model, Prob (F-statistic), which here is <0.05. This also tells us Python is using an ANOVA test, which implies an F-distribution.
Since we are performing an epidemiological study to observe/test an association, instead of building a predictive model, we will not split our data into training and testing. So, do not be surprised if this approach seems different from what you will read in most data science tutorials.
Regarding the output on the bottom half
We are going to read the output, column by column. Here we are assessing the association between our outcome (sysbp) and our independent variable, age (first column), which has a positive association of 1.0128 (or coefficient, or second column), with a standard error of 0.036 (third column), with a t-ratio (for a t-distribution) of 27.911 (fourth column) which gives a p-value <0.001 (fifth column), corresponding to a 95% confidence interval for the beta coefficient between 0.94 2(lower bound, the sixth column) and 1.084 (upper bound, seventh column).
To make things simpler…
We can reject the null hypothesis of no association between age and sBP. There is a positive (beta coefficient>1.0) association between age and systolic blood pressure, which is statistically significant (p<0.01 and 95% CI does not include a coefficient=0.0).
Remember the beta coefficient can take a value between -∞ and +∞.
- A positive coefficient represents a positive linear association;
- A negative coefficient represents a negative linear association.
Regarding the 95% Confidence Interval (95% CI) for the beta coefficient
- If the 95% CI includes 0, the association is not statistically significant (p≥0.05), the null hypothesis of no association cannot be rejected;
- If the 95% CI does not include 0, the association is statistically significant (p<0.05), the null hypothesis of association can be rejected.
What about the “(Intercept)”?
In this case (Intercept)= 82.14 represents the intercept, or the β0. This means, when age=0, sBP=82mmHg. In this context, the β0 is irrelevant for our research question, because we are NOT building a predictive model. We are assessing an association.
5. Running and reading a multiple linear regression
The process here is quite simple, just add all covariates (independent variables) after your independent variable (in this case after sysbp).
model2 = smf.ols(formula='sysBP ~ age + male + education + cigsPerDay + BPMeds + totChol', data=df).fit()print(model2.summary())
Before reading the output, we see this is a valid model but… it is not quite what we wanted… Discrete variables with multiple categories (eg; education) are not stratified by groups. We can specify the type of covariates.
model3 = smf.ols(formula='sysBP ~ age + male + C(education) + cigsPerDay + BPMeds + totChol', data=df).fit()print(model3.summary())
Great, now we can move to the interpretation of our output.
Let’s take a look at the continuous covariates
First of all, notice that there is a slightly different value for coefficient and 95%CI for age, when compared to the previous simple linear regression. This is a normal and expected phenomenon. As observed before, there is a statistically significant positive linear regression.
Then notice that we only have 4007 observations. This is because 233 patients (4240–4007) had at least one missing observation for any of all the variables in the model.
Remember classical OLS can only run with complete data, meaning that every time one observation (row) has missing data for at least one covariate, this row will not enter the model.
How do we read the output; taking as example age:
- The null hypothesis (H0) is: In the population, there is no linear association between age and sBP, when controlling for gender, education, smoking, anti-hypertensors intake, and total cholesterol (in other words, when controlling for all other covariates present in the model);
- Since p-value<0.05, we reject H0, meaning there is a linear association between age and sBP, when controlling for gender, education, smoking, anti-hypertensors intake, and total cholesterol;
- Interpretation of the slope: For each 1 unit increase in age, mean sBP increases +0.84mmHg.
Notice an important pattern regarding p-values and 95%CI
- When p<0.05, the confidence interval does NOT include 0.0 (eg: age, beta coefficient (95%CI): 0.84 [0.76;0.91]);
- When p≥0.05 the confidence interval includes 0.0 (eg; male, beta coefficient (95% CI): -0.087 [-1.40;+1.22]).
Let’s take a look at the discrete binary covariates
Let’s look at gender (male=0 means female patient; male = 1 means male patient):
- The null hypothesis (H0) is: In the population, there is no difference between males and females in their average value for sBP, when controlling for age, education, smoking, anti-hypertensors intake, and total cholesterol;
- Here beta coefficient (95% CI) = -0.09 (-1.4;+1.23). Since the 95%CI includes 0.0 (and p≥0.05) we fail to reject H0.
Let’s take a look at the discrete non-binary covariates
Now we reach the most puzzling part of the model… The discrete non-binary covariate; Education.
But, what’s missing? Category one (education=1)
Why? Because all other categories are compared relative to the reference category (education=1)
df.education.value_counts()
As we see there are 4 categories (1, 2, 3, 4). In the original dataset
- 1 = Some High School
- 2= High School or GED
- 3= Some College or Vocational School
- 4= College
By default, the model uses the category with the lowest index (in this case 1, some high school) as the reference. All remaining categories in the model are compared to the category of reference.
But, how do I read the coefficients for education?
In the case of education, looking first at the first row, when education=2
- The null hypothesis (H0) is: In the population, there is no difference between those with “some high school” (education=1) and those with “high school” (education=2) in their average value for sBP, when controlling for age, gender, smoking, anti-hypertensors intake, and total cholesterol;
- Since p-value≥0.05, we reject H0, we fail to reject the null hypothesis for this category.
What about different education categories?
We see that for the 3rd category for Education (“some College”), the beta coefficient (95% CI): -3.57 (-5.35;-1.79), meaning that; “Compared to patients with Some High School (Education category =1), patients with Some College (Education =3) have a lower mean sBP which and this difference is statistically significant”.
Note that in this model we cannot compare the association between patients in the category Education=2 and patients in the category Education=3. For this, we would have to change the reference.
6. Answering our research question
Is there an association between systolic blood pressure and age?
Yes, there is a positive association between age and systolic blood pressure (Betta coefficient [95%CI]: 0.84[0.76;0.91]), which is statistically significant (p<0.05 and confidence interval does not include 0), even when considering the effect of possible relevant confounders (covariates in the model).