Reputation: 517
I have a Pandas DataFrame with the following structure:
In [1]: df
Out[1]:
location_code month amount
0 1 1 10
1 1 2 11
2 1 3 12
3 1 4 13
4 1 5 14
5 1 6 15
6 2 1 23
7 2 2 25
8 2 3 27
9 2 4 29
10 2 5 31
11 2 6 33
I also have a DataFrame with the following:
In [2]: output_df
Out[2]:
location_code regression_coef
0 1 None
1 2 None
What I would like:
output_df = df.groupby('location_code')[amount].apply(linear_regression_and_return_coefficient)
I would like to group by the location code
and then perform a linear regression on the values of amount
and store the coefficient. I have tried the following code:
import pandas as pd
import statsmodels.api as sm
import numpy as np
gb = df.groupby('location_code')['amount']
x = []
for j in range(6): x.append(j+1)
for location_code, amount in gb:
trans = amount.tolist()
x = sm.add_constant(x)
model = sm.OLS(trans, x)
results = model.fit()
output_df['regression_coef'][merchant_location_code] = results.params[1]/np.mean(trans)
This code works, but my data set is somewhat large (about 5 gb) and a bit more complex, and this is taking a REALLY LONG TIME. I am wondering if there is a vectorized operation that can do this more efficiently? I know that using loops on a Pandas DataFrame is bad.
SOLUTION
After some tinkering around, I wrote a function that can be used with the apply
method on a groupby
.
def get_lin_reg_coef(series):
x=sm.add_constant(range(1,7))
result = sm.OLS(series, x).fit().params[1]
return result/series.mean()
gb = df.groupby('location_code')['amount']
output_df['lin_reg_coef'] = gb.apply(get_lin_reg_coef)
Benchmarking this versus the iterative solution I had before, with varying input sizes gets:
DataFrame Rows Iterative Solution (sec) Vectorized Solution (sec)
370,000 81.42 84.46
1,850,000 448.36 365.66
3,700,000 1282.83 715.89
7,400,000 5034.62 1407.88
Clearly a lot faster as the dataset grows in size!
Upvotes: 2
Views: 2212
Reputation: 2511
Without knowing more about the data, number of records, etc, this code should run faster:
import pandas as pd
import statsmodels.api as sm
import numpy as np
gb = df.groupby('location_code')['amount']
x = sm.add_constant(range(1,7))
def fit(stuff):
return sm.OLS(stuff["amount"], x).fit().params[1] / stuff["amount"].mean()
output = gb.apply(fit)
Upvotes: 1