Reputation: 1097
I have a dataframe Table and I want to get the linear regression coefficients for each group (city, category) using Price and Sold data.
table = (pd.DataFrame({'city' : ['London', 'London', 'London', 'London', 'Paris', 'Paris'], 'category' : ['boots', 'boots', 'hats', 'hats', 'boots', 'boots'], 'Price':[100, 200, 100, 361, 261, 90], 'Sold':[1,7, 5, 2, 11, 7]}))
So I do:
def linreg (X, Y):
linreg = LinearRegression().fit(X,Y)
const = linreg.intercept_
slope = linreg.coef_
return const, slope
X = pd.DataFrame(table[['Price']])
Y = pd.DataFrame(table[['Sold']])
print (table.groupby(['city', 'category']).apply( lambda x: linreg(X, Y)))
But as I result I get the same numbers for each group, which is wrong ('cause when I calculate it separately, the results are different):
city category
London boots ([5.171956504957155], [[0.0017700188581448472]])
hats ([5.171956504957155], [[0.0017700188581448472]])
Paris boots ([5.171956504957155], [[0.0017700188581448472]])
What am I doing wrong? As a result I want to have a dataframe like this:
City category Slope const
London boots 1.96 0.52
London hats 0.06 -5
Paris boots 0.02. 4.8
Upvotes: 0
Views: 282
Reputation: 35676
The global variables X
and Y
:
X = pd.DataFrame(table[['Price']])
Y = pd.DataFrame(table[['Sold']])
Are being used here:
.apply( lambda x: linreg(X, Y)))
For this reason, the same frames are being passed in and calculated, and the local variable x
(the current group) is not being used at all. Hence, every group having the same results.
Instead, utilize the variable that comes from the groupby
and pass those 'Price'
and 'Sold'
frames into the linreg
function.
import pandas as pd
from sklearn.linear_model import LinearRegression
table = (pd.DataFrame({'city': ['London', 'London', 'London',
'London', 'Paris', 'Paris'],
'category': ['boots', 'boots', 'hats',
'hats', 'boots', 'boots'],
'Price': [100, 200, 100, 361, 261, 90],
'Sold': [1, 7, 5, 2, 11, 7]}))
def lin_reg(x, y):
l_reg = LinearRegression().fit(x, y)
slope = l_reg.coef_[0][0] # Get Value Instead of Nested List
const = l_reg.intercept_[0] # Get Value Instead of List
return slope, const # Return in Order they should appear in result table
print(
table.groupby(['city', 'category'])
.apply(lambda s:
# Create Series so results expand into 2 Columns
pd.Series(
lin_reg(
s['Price'].to_frame(),
s['Sold'].to_frame()
), index=['slope', 'const'])
)
.reset_index()
)
Output:
city category slope const
0 London boots 0.060000 -5.000000
1 London hats -0.011494 6.149425
2 Paris boots 0.023392 4.894737
Upvotes: 2