user1642513
user1642513

Reputation:

Python pandas: how to run multiple univariate regression by group

Suppose I have a DataFrame with one column of y variable and many columns of x variables. I would like to be able to run multiple univariate regressions of y vs x1, y vs x2, ..., etc, and store the predictions back into the DataFrame. Also I need to do this by a group variable.

import statsmodels.api as sm
import pandas as pd

df = pd.DataFrame({
  'y': np.random.randn(20),
  'x1': np.random.randn(20), 
  'x2': np.random.randn(20),
  'grp': ['a', 'b'] * 10})

def ols_res(x, y):
    return sm.OLS(y, x).fit().predict()

df.groupby('grp').apply(ols_res) # This does not work

The code above obviously does not work. It is not clear to me how to correctly pass the fixed y to the function while having apply iterating through the x columns(x1, x2, ...). I suspect there might be a very clever one-line solution to do this. Any idea?

Upvotes: 9

Views: 9505

Answers (1)

JaminSore
JaminSore

Reputation: 3936

The function you pass to apply must take a pandas.DataFrame as a first argument. You can pass additional keyword or positional arguments to apply that get passed to the applied function. So your example would work with a small modification. Change ols_res to

def ols_res(df, xcols,  ycol):
    return sm.OLS(df[ycol], df[xcols]).fit().predict()

Then, you can use groupby and apply like this

df.groupby('grp').apply(ols_res, xcols=['x1', 'x2'], ycol='y')

Or

df.groupby('grp').apply(ols_res, ['x1', 'x2'], 'y')

EDIT

The above code does not run multiple univariate regressions. Instead, it runs one multivariate regression per group. With (another) slight modification it will, however.

def ols_res(df, xcols,  ycol):
    return pd.DataFrame({xcol : sm.OLS(df[ycol], df[xcol]).fit().predict() for xcol in xcols})

EDIT 2

Although, the above solution works, I think the following is a little more pandas-y

import statsmodels.api as sm
import pandas as pd
import numpy as np

df = pd.DataFrame({
  'y': np.random.randn(20),
  'x1': np.random.randn(20), 
  'x2': np.random.randn(20),
  'grp': ['a', 'b'] * 10})

def ols_res(x, y):
    return pd.Series(sm.OLS(y, x).fit().predict())

df.groupby('grp').apply(lambda x : x[['x1', 'x2']].apply(ols_res, y=x['y']))

For some reason, if I define ols_res() as it was originally, the resultant DataFrame doesn't have the group label in the index.

Upvotes: 7

Related Questions