Polly
Polly

Reputation: 1097

applying function using group by in Pandas gives the same result for al groups

I have a dataframe Table and I want to get the linear regression coefficients for each group (city, category) using Price and Sold data.

table = (pd.DataFrame({'city' : ['London', 'London', 'London', 'London', 'Paris', 'Paris'], 'category' : ['boots', 'boots', 'hats', 'hats', 'boots', 'boots'], 'Price':[100, 200, 100, 361, 261, 90], 'Sold':[1,7,  5, 2,  11, 7]}))

dataframe

So I do:

def linreg (X, Y):
    linreg = LinearRegression().fit(X,Y)
    const = linreg.intercept_
    slope = linreg.coef_
    return const, slope


X = pd.DataFrame(table[['Price']])
Y = pd.DataFrame(table[['Sold']])

print (table.groupby(['city', 'category']).apply( lambda x: linreg(X, Y)))

But as I result I get the same numbers for each group, which is wrong ('cause when I calculate it separately, the results are different):

city    category
London  boots       ([5.171956504957155], [[0.0017700188581448472]])
        hats        ([5.171956504957155], [[0.0017700188581448472]])
Paris   boots       ([5.171956504957155], [[0.0017700188581448472]])

What am I doing wrong? As a result I want to have a dataframe like this:

City   category Slope const
London boots    1.96   0.52
London hats     0.06    -5
Paris  boots   0.02.    4.8

Upvotes: 0

Views: 282

Answers (1)

Henry Ecker
Henry Ecker

Reputation: 35676

The global variables X and Y:

X = pd.DataFrame(table[['Price']])
Y = pd.DataFrame(table[['Sold']])

Are being used here:

.apply( lambda x: linreg(X, Y)))

For this reason, the same frames are being passed in and calculated, and the local variable x (the current group) is not being used at all. Hence, every group having the same results.


Instead, utilize the variable that comes from the groupby and pass those 'Price' and 'Sold' frames into the linreg function.

import pandas as pd
from sklearn.linear_model import LinearRegression

table = (pd.DataFrame({'city': ['London', 'London', 'London',
                                'London', 'Paris', 'Paris'],
                       'category': ['boots', 'boots', 'hats',
                                    'hats', 'boots', 'boots'],
                       'Price': [100, 200, 100, 361, 261, 90],
                       'Sold': [1, 7, 5, 2, 11, 7]}))


def lin_reg(x, y):
    l_reg = LinearRegression().fit(x, y)
    slope = l_reg.coef_[0][0]  # Get Value Instead of Nested List
    const = l_reg.intercept_[0]  # Get Value Instead of List
    return slope, const  # Return in Order they should appear in result table


print(
    table.groupby(['city', 'category'])
        .apply(lambda s:
               # Create Series so results expand into 2 Columns
               pd.Series(
                   lin_reg(
                       s['Price'].to_frame(),
                       s['Sold'].to_frame()
                   ), index=['slope', 'const'])
               )
        .reset_index()
)

Output:

     city category     slope     const
0  London    boots  0.060000 -5.000000
1  London     hats -0.011494  6.149425
2   Paris    boots  0.023392  4.894737

Upvotes: 2

Related Questions