E. Zeytinci
E. Zeytinci

Reputation: 2643

How can I use groupby with multiple values in a column in pandas?

I've a dataframe like as follows,

import pandas as pd

data = {
    'brand': ['Mercedes', 'Renault', 'Ford', 'Mercedes', 'Mercedes', 'Mercedes', 'Renault'],
    'model': ['X', 'Y', 'Z', 'X', 'X', 'X', 'Q'],
    'year': [2011, 2010, 2009, 2010, 2012, 2020, 2011],
    'price': [None, 1000.4, 2000.3, 1000.0, 1100.3, 3000.5, None]
}

df = pd.DataFrame(data)
print(df)

      brand model  year   price
0  Mercedes     X  2011     NaN
1   Renault     Y  2010  1000.4
2      Ford     Z  2009  2000.3
3  Mercedes     X  2010  1000.0
4  Mercedes     X  2012  1100.3
5  Mercedes     X  2020  3000.5
6   Renault     Q  2011     NaN

And here is the another case to test your solution,

data = {
    'brand': ['Mercedes', 'Mercedes', 'Mercedes', 'Mercedes', 'Mercedes'], 
    'model': ['X', 'X', 'X', 'X', 'X'], 'year': [2017, 2018, 2018, 2019, 2019], 
    'price': [None, None, None, 1000.0, 1200.50]
}

Expected output,

      brand model  year    price
0  Mercedes     X  2017      NaN
1  Mercedes     X  2018  1100.25
2  Mercedes     X  2018  1100.25
3  Mercedes     X  2019  1000.00
4  Mercedes     X  2019  1200.50

I want to fill the missing values with the average of the observations containing year-1, year and year+1 and also same brand and model. For instance, Mercedes X model has a null price in 2011. When I look at the data,

2011 - 1 = 2010
2011 + 1 = 2012

The 4th observation -> Mercedes,X,2010,1000.0
The 5th observation -> Mercedes,X,2012,1100.3

The mean -> (1000.0 + 1100.3) / 2 = 1050.15

I've tried something as follows,

for c_key, _ in df.groupby(['brand', 'model', 'year']):
    fc = (
        (df['brand'] == c_key[0])
        & (df['model'] == c_key[1])
        & (df['year'].isin([c_key[2] + 1, c_key[2], c_key[2] - 1]))
    )
    
    sc = (
        (df['brand'] == c_key[0])
        & (df['model'] == c_key[1])
        & (df['year'] == c_key[2])
        & (df['price'].isnull())
    )
        
    mean_val = df[fc]['price'].mean()

    df.loc[sc, 'price'] = mean_val

print(df)

      brand model  year    price
0  Mercedes     X  2011  1050.15
1   Renault     Y  2010  1000.40
2      Ford     Z  2009  2000.30
3  Mercedes     X  2010  1000.00
4  Mercedes     X  2012  1100.30
5  Mercedes     X  2020  3000.50
6   Renault     Q  2011      NaN

But this solution takes a long time for 90,000 rows and 27 columns so, is there a more effective solution? For instance, can I use groupby for the values year-1, year, year+1, brand and model?

Thanks in advance.

Upvotes: 5

Views: 1344

Answers (5)

This is not a pretty solution, but from your description, I believe it would work and be really fast. It's just a lot of ifs inside a np.where on a sorted data frame.

import pandas as pd
import numpy as np

data = pd.DataFrame({
    'brand': ['Mercedes', 'Renault', 'Ford', 'Mercedes', 'Mercedes', 'Mercedes', 'Renault'],
    'model': ['X', 'Y', 'Z', 'X', 'X', 'X', 'Q'],
    'year': [2011, 2010, 2009, 2010, 2012, 2020, 2011],
    'price': [None, 1000.4, 2000.3, 1000.0, 1100.3, 3000.5, None]
})

data = data.sort_values(by=['brand', 'model', 'year'])
data['adjusted_price'] = np.where(data['price'].isnull() & 
                                  (data['brand']==data['brand'].shift(1)) & (data['brand']==data['brand'].shift(-1)) &
                                  (data['model']==data['model'].shift(1)) & (data['model']==data['model'].shift(-1)) & 
                                  (data['year']==(data['year'].shift(1)+1))&(data['year']==(data['year'].shift(-1)-1)),
                                  (data['price'].shift(1)+data['price'].shift(-1))/2,
                                  data['price'])
data['price'] = data['adjusted_price']
data = data.drop(['adjusted_price'], axis=1)

Upvotes: 1

Yolao_21
Yolao_21

Reputation: 895

def fill_it(x):
    return df[(df.brand==df.iat[x,0])&(df.model==df.iat[x,1])&((df.year==df.iat[x,2]-1)|(df.year==df.iat[x,2]+1))].price.mean()



df = df.apply(lambda x: x.fillna(fill_it(x.name)), axis=1)
df



Output 1:
    brand   model   year    price
0   Mercedes    X   2011    1050.15
1   Renault     Y   2010    1000.40
2   Ford        Z   2009    2000.30
3   Mercedes    X   2010    1000.00
4   Mercedes    X   2012    1100.30
5   Mercedes    X   2020    3000.50
6   Renault     Q   2011    NaN


Output 2:
    brand   model   year    price
0   Mercedes    X   2017    NaN
1   Mercedes    X   2018    1100.25
2   Mercedes    X   2018    1100.25
3   Mercedes    X   2019    1000.00
4   Mercedes    X   2019    1200.50

This is 3x Faster

df.loc[df.price.isna(), 'price'] = df[df.price.isna()].apply(lambda x: x.fillna(fill_it(x.name)), axis=1)

I tried with another approach, using pd.rolling and it is way faster (on the dataframe with 70k rows runs in 200ms). The outputs are still as you wanted them.

df.year = pd.to_datetime(df.year, format='%Y')
df.sort_values('year', inplace=True)
df.groupby(['brand', 'model']).apply(lambda x: x.fillna(x.rolling('1095D',on='year', center=True).mean())).sort_index()

Upvotes: 1

aaossa
aaossa

Reputation: 3872

Here goes a solution that looks simpler:

  1. Sort values in the original dataframe:

    df = df.sort_values(["brand", "model", "year"])
    
  2. Group by "brand" and "model", and store the groups in a variable (to calculate only once):

    groups = df.groupby(["brand", "model"])
    
  3. Fill nan values using the average of the previous and next rows (Important: this assumes that you have data of consecutive years, meaning that if you're missing data for 2015 you know the values of 2014 and 2016. If you have no data for consecutive years, null values will remain null).

    df["price"] = df["price"].fillna((groups["price"].ffill(limit=1) + groups["price"].bfill(limit=1)) / 2)
    

Resulting code:

df = df.sort_values(["brand", "model", "year"])
groups = df.groupby(["brand", "model"])
df["price"] = df["price"].fillna((groups["price"].ffill(limit=1) + groups["price"].bfill(limit=1)) / 2)
print(df)

Output:

      brand model  year    price
2      Ford     Z  2009  2000.30
3  Mercedes     X  2010  1000.00
0  Mercedes     X  2011  1050.15
4  Mercedes     X  2012  1100.30
5  Mercedes     X  2020  3000.50
6   Renault     Q  2011      NaN
1   Renault     Y  2010  1000.40

Upvotes: 2

braml1
braml1

Reputation: 584

Based on the solution of @richardec, but with some addition to correct the price if the next year's price is known. Not sure if it faster than your original solution though

# Make an interpolated average 
df_out = df.sort_values(['brand', 'year']).groupby('brand').apply(lambda g: g.interpolate(limit_area='inside'))

# Make an average per brand/year/model
df1 = df.sort_values(['brand', 'year']).groupby(['brand','year','model']).mean().reset_index()

# Check if the next line has the same brand and model. If so, take the next average price when the price isNa
mask1 = df1["model"] == df1["model"].shift(-1)
mask2 = df1["brand"] == df1["brand"].shift(-1)
mask3 = df1["price"].isna()
df1["priceCorr"] = np.where(mask1 & mask2 & mask3 ,df1["price"].shift(-1),df1["price"] )

# Merge everything together
df_out = df_out.merge(df1[["brand", "year", "model","priceCorr"]], on=["brand", "year", "model"])
df_out["price"] = np.where(df_out["price"].isna(),df_out["priceCorr"], df_out["price"])

Upvotes: 2

user17242583
user17242583

Reputation:

I think actually a more efficient way would be to sort by Brand and then Year, and then use interpolate:

df = df.sort_values(['brand', 'year']).groupby('brand').apply(lambda g: g.interpolate(limit_area='inside'))

Output:

>>> df
      brand model  year    price
0  Mercedes     X  2011  1050.15
1   Renault     Y  2010  1000.40
2      Ford     Z  2009  2000.30
3  Mercedes     X  2010  1000.00
4  Mercedes     X  2012  1100.30
5  Mercedes     X  2020  3000.50
6   Renault     Q  2011  1000.40

That also handles all the columns.

Upvotes: 2

Related Questions