Reputation: 121
I have this dataframe:
import pandas as pd
import statsmodels.formula.api as sm
df = pd.DataFrame({"A1": [10,20,30,40,50], "A2": [40,30,50,60,70], "B": [20, 30, 10, 40, 50],\
"C": [32, 234, 23, 23, 42523], "D": [55,462,564,13,56],})
A1 A2 B C D
0 10 40 20 32 55
1 20 30 30 234 462
2 30 50 10 45 564
3 40 60 40 33 13
4 50 70 50 425 56
I want to perform multiple linear regression with multiple independent variables (A1 & A2) with this dataframe, but I'm confused on how to utilize this dataframe within the formula:
result = sm.ols(formula = "A1,A2 ~ B + C + D", data = df).fit()
This doesn't work because I can only give one independent variable, do I have to make multiple dataframes?
Upvotes: 2
Views: 4692
Reputation: 18306
Regression with 2 independent variables is equivalent to 2 linear regression models with one independent variable each. This generalizes to N.
So, you can do this:
result_1 = sm.ols(formula="A1 ~ B + C + D", data=df).fit()
result_2 = sm.ols(formula="A2 ~ B + C + D", data=df).fit()
If you had more than 2 and if they all start with A
, for example, we can generalize this to
indep_vars = df.filter(regex="^A").columns
dependents = df.columns.difference(indep_vars)
results = [sm.ols(formula=f"{indep} ~ {' + '.join(dependents)}", data=df).fit()
for indep in indep_vars]
and then you can index into the results
.
Upvotes: 1