Reputation:
Suppose I have a DataFrame
with one column of y
variable and many columns of x
variables. I would like to be able to run multiple univariate regressions of y
vs x1
, y
vs x2
, ..., etc, and store the predictions back into the DataFrame
. Also I need to do this by a group variable.
import statsmodels.api as sm
import pandas as pd
df = pd.DataFrame({
'y': np.random.randn(20),
'x1': np.random.randn(20),
'x2': np.random.randn(20),
'grp': ['a', 'b'] * 10})
def ols_res(x, y):
return sm.OLS(y, x).fit().predict()
df.groupby('grp').apply(ols_res) # This does not work
The code above obviously does not work. It is not clear to me how to correctly pass the fixed y
to the function while having apply
iterating through the x
columns(x1
, x2
, ...). I suspect there might be a very clever one-line solution to do this. Any idea?
Upvotes: 9
Views: 9505
Reputation: 3936
The function you pass to apply
must take a pandas.DataFrame
as a first argument. You can pass additional keyword or positional arguments to apply
that get passed to the applied function. So your example would work with a small modification. Change ols_res
to
def ols_res(df, xcols, ycol):
return sm.OLS(df[ycol], df[xcols]).fit().predict()
Then, you can use groupby
and apply
like this
df.groupby('grp').apply(ols_res, xcols=['x1', 'x2'], ycol='y')
Or
df.groupby('grp').apply(ols_res, ['x1', 'x2'], 'y')
EDIT
The above code does not run multiple univariate regressions. Instead, it runs one multivariate regression per group. With (another) slight modification it will, however.
def ols_res(df, xcols, ycol):
return pd.DataFrame({xcol : sm.OLS(df[ycol], df[xcol]).fit().predict() for xcol in xcols})
EDIT 2
Although, the above solution works, I think the following is a little more pandas-y
import statsmodels.api as sm
import pandas as pd
import numpy as np
df = pd.DataFrame({
'y': np.random.randn(20),
'x1': np.random.randn(20),
'x2': np.random.randn(20),
'grp': ['a', 'b'] * 10})
def ols_res(x, y):
return pd.Series(sm.OLS(y, x).fit().predict())
df.groupby('grp').apply(lambda x : x[['x1', 'x2']].apply(ols_res, y=x['y']))
For some reason, if I define ols_res()
as it was originally, the resultant DataFrame
doesn't have the group label in the index.
Upvotes: 7