Cam
Cam

Reputation: 121

How to run OLS regression on pandas dataframe with multiple indepenent variables?

I have this dataframe:

import pandas as pd
import statsmodels.formula.api as sm

df = pd.DataFrame({"A1": [10,20,30,40,50], "A2": [40,30,50,60,70], "B": [20, 30, 10, 40, 50],\
                   "C": [32, 234, 23, 23, 42523], "D": [55,462,564,13,56],})
   
    A1  A2  B   C   D
0   10  40  20  32  55
1   20  30  30  234 462
2   30  50  10  45  564
3   40  60  40  33  13
4   50  70  50  425 56

I want to perform multiple linear regression with multiple independent variables (A1 & A2) with this dataframe, but I'm confused on how to utilize this dataframe within the formula:

result = sm.ols(formula = "A1,A2 ~ B + C + D", data = df).fit()

This doesn't work because I can only give one independent variable, do I have to make multiple dataframes?

Upvotes: 2

Views: 4692

Answers (1)

Mustafa Aydın
Mustafa Aydın

Reputation: 18306

Regression with 2 independent variables is equivalent to 2 linear regression models with one independent variable each. This generalizes to N.

So, you can do this:

result_1 = sm.ols(formula="A1 ~ B + C + D", data=df).fit()
result_2 = sm.ols(formula="A2 ~ B + C + D", data=df).fit()

If you had more than 2 and if they all start with A, for example, we can generalize this to

indep_vars = df.filter(regex="^A").columns
dependents = df.columns.difference(indep_vars)

results = [sm.ols(formula=f"{indep} ~ {' + '.join(dependents)}", data=df).fit()
           for indep in indep_vars]

and then you can index into the results.

Upvotes: 1

Related Questions