invoker
invoker

Reputation: 517

Performing math on a Python Pandas Group By DataFrame

I have a Pandas DataFrame with the following structure:

In [1]: df
Out[1]: 
    location_code    month    amount
0    1               1        10
1    1               2        11
2    1               3        12
3    1               4        13
4    1               5        14
5    1               6        15
6    2               1        23
7    2               2        25
8    2               3        27
9    2               4        29
10   2               5        31
11   2               6        33

I also have a DataFrame with the following:

In [2]: output_df
Out[2]: 
    location_code    regression_coef
0   1                None
1   2                None

What I would like:

output_df = df.groupby('location_code')[amount].apply(linear_regression_and_return_coefficient)

I would like to group by the location code and then perform a linear regression on the values of amount and store the coefficient. I have tried the following code:

import pandas as pd
import statsmodels.api as sm
import numpy as np

gb = df.groupby('location_code')['amount']

x = []
for j in range(6): x.append(j+1)

for location_code, amount in gb:
    trans = amount.tolist()
    x = sm.add_constant(x)
    model = sm.OLS(trans, x)
    results = model.fit()
    output_df['regression_coef'][merchant_location_code] = results.params[1]/np.mean(trans)

This code works, but my data set is somewhat large (about 5 gb) and a bit more complex, and this is taking a REALLY LONG TIME. I am wondering if there is a vectorized operation that can do this more efficiently? I know that using loops on a Pandas DataFrame is bad.

SOLUTION

After some tinkering around, I wrote a function that can be used with the apply method on a groupby.

def get_lin_reg_coef(series):
    x=sm.add_constant(range(1,7))
    result = sm.OLS(series, x).fit().params[1]
    return result/series.mean()

gb = df.groupby('location_code')['amount']

output_df['lin_reg_coef'] = gb.apply(get_lin_reg_coef)

Benchmarking this versus the iterative solution I had before, with varying input sizes gets:

DataFrame Rows    Iterative Solution (sec)    Vectorized Solution (sec)
       370,000    81.42                       84.46
     1,850,000    448.36                      365.66
     3,700,000    1282.83                     715.89
     7,400,000    5034.62                     1407.88         

Clearly a lot faster as the dataset grows in size!

Upvotes: 2

Views: 2212

Answers (1)

gabe
gabe

Reputation: 2511

Without knowing more about the data, number of records, etc, this code should run faster:

import pandas as pd
import statsmodels.api as sm
import numpy as np

gb = df.groupby('location_code')['amount']

x = sm.add_constant(range(1,7))

def fit(stuff):
    return sm.OLS(stuff["amount"], x).fit().params[1] / stuff["amount"].mean()

output = gb.apply(fit)

Upvotes: 1

Related Questions