Reputation: 2643
I've a dataframe like as follows,
import pandas as pd
data = {
'brand': ['Mercedes', 'Renault', 'Ford', 'Mercedes', 'Mercedes', 'Mercedes', 'Renault'],
'model': ['X', 'Y', 'Z', 'X', 'X', 'X', 'Q'],
'year': [2011, 2010, 2009, 2010, 2012, 2020, 2011],
'price': [None, 1000.4, 2000.3, 1000.0, 1100.3, 3000.5, None]
}
df = pd.DataFrame(data)
print(df)
brand model year price
0 Mercedes X 2011 NaN
1 Renault Y 2010 1000.4
2 Ford Z 2009 2000.3
3 Mercedes X 2010 1000.0
4 Mercedes X 2012 1100.3
5 Mercedes X 2020 3000.5
6 Renault Q 2011 NaN
And here is the another case to test your solution,
data = {
'brand': ['Mercedes', 'Mercedes', 'Mercedes', 'Mercedes', 'Mercedes'],
'model': ['X', 'X', 'X', 'X', 'X'], 'year': [2017, 2018, 2018, 2019, 2019],
'price': [None, None, None, 1000.0, 1200.50]
}
Expected output,
brand model year price
0 Mercedes X 2017 NaN
1 Mercedes X 2018 1100.25
2 Mercedes X 2018 1100.25
3 Mercedes X 2019 1000.00
4 Mercedes X 2019 1200.50
I want to fill the missing values with the average of the observations containing year-1, year and year+1 and also same brand and model. For instance, Mercedes X model has a null price in 2011. When I look at the data,
2011 - 1 = 2010
2011 + 1 = 2012
The 4th observation -> Mercedes,X,2010,1000.0
The 5th observation -> Mercedes,X,2012,1100.3
The mean -> (1000.0 + 1100.3) / 2 = 1050.15
I've tried something as follows,
for c_key, _ in df.groupby(['brand', 'model', 'year']):
fc = (
(df['brand'] == c_key[0])
& (df['model'] == c_key[1])
& (df['year'].isin([c_key[2] + 1, c_key[2], c_key[2] - 1]))
)
sc = (
(df['brand'] == c_key[0])
& (df['model'] == c_key[1])
& (df['year'] == c_key[2])
& (df['price'].isnull())
)
mean_val = df[fc]['price'].mean()
df.loc[sc, 'price'] = mean_val
print(df)
brand model year price
0 Mercedes X 2011 1050.15
1 Renault Y 2010 1000.40
2 Ford Z 2009 2000.30
3 Mercedes X 2010 1000.00
4 Mercedes X 2012 1100.30
5 Mercedes X 2020 3000.50
6 Renault Q 2011 NaN
But this solution takes a long time for 90,000 rows and 27 columns so, is there a more effective solution? For instance, can I use groupby
for the values year-1, year, year+1, brand and model?
Thanks in advance.
Upvotes: 5
Views: 1344
Reputation: 151
This is not a pretty solution, but from your description, I believe it would work and be really fast. It's just a lot of if
s inside a np.where
on a sorted data frame.
import pandas as pd
import numpy as np
data = pd.DataFrame({
'brand': ['Mercedes', 'Renault', 'Ford', 'Mercedes', 'Mercedes', 'Mercedes', 'Renault'],
'model': ['X', 'Y', 'Z', 'X', 'X', 'X', 'Q'],
'year': [2011, 2010, 2009, 2010, 2012, 2020, 2011],
'price': [None, 1000.4, 2000.3, 1000.0, 1100.3, 3000.5, None]
})
data = data.sort_values(by=['brand', 'model', 'year'])
data['adjusted_price'] = np.where(data['price'].isnull() &
(data['brand']==data['brand'].shift(1)) & (data['brand']==data['brand'].shift(-1)) &
(data['model']==data['model'].shift(1)) & (data['model']==data['model'].shift(-1)) &
(data['year']==(data['year'].shift(1)+1))&(data['year']==(data['year'].shift(-1)-1)),
(data['price'].shift(1)+data['price'].shift(-1))/2,
data['price'])
data['price'] = data['adjusted_price']
data = data.drop(['adjusted_price'], axis=1)
Upvotes: 1
Reputation: 895
def fill_it(x):
return df[(df.brand==df.iat[x,0])&(df.model==df.iat[x,1])&((df.year==df.iat[x,2]-1)|(df.year==df.iat[x,2]+1))].price.mean()
df = df.apply(lambda x: x.fillna(fill_it(x.name)), axis=1)
df
Output 1:
brand model year price
0 Mercedes X 2011 1050.15
1 Renault Y 2010 1000.40
2 Ford Z 2009 2000.30
3 Mercedes X 2010 1000.00
4 Mercedes X 2012 1100.30
5 Mercedes X 2020 3000.50
6 Renault Q 2011 NaN
Output 2:
brand model year price
0 Mercedes X 2017 NaN
1 Mercedes X 2018 1100.25
2 Mercedes X 2018 1100.25
3 Mercedes X 2019 1000.00
4 Mercedes X 2019 1200.50
This is 3x Faster
df.loc[df.price.isna(), 'price'] = df[df.price.isna()].apply(lambda x: x.fillna(fill_it(x.name)), axis=1)
I tried with another approach, using pd.rolling
and it is way faster (on the dataframe with 70k rows runs in 200ms). The outputs are still as you wanted them.
df.year = pd.to_datetime(df.year, format='%Y')
df.sort_values('year', inplace=True)
df.groupby(['brand', 'model']).apply(lambda x: x.fillna(x.rolling('1095D',on='year', center=True).mean())).sort_index()
Upvotes: 1
Reputation: 3872
Here goes a solution that looks simpler:
Sort values in the original dataframe:
df = df.sort_values(["brand", "model", "year"])
Group by "brand" and "model", and store the groups in a variable (to calculate only once):
groups = df.groupby(["brand", "model"])
Fill nan values using the average of the previous and next rows (Important: this assumes that you have data of consecutive years, meaning that if you're missing data for 2015 you know the values of 2014 and 2016. If you have no data for consecutive years, null values will remain null).
df["price"] = df["price"].fillna((groups["price"].ffill(limit=1) + groups["price"].bfill(limit=1)) / 2)
Resulting code:
df = df.sort_values(["brand", "model", "year"])
groups = df.groupby(["brand", "model"])
df["price"] = df["price"].fillna((groups["price"].ffill(limit=1) + groups["price"].bfill(limit=1)) / 2)
print(df)
Output:
brand model year price
2 Ford Z 2009 2000.30
3 Mercedes X 2010 1000.00
0 Mercedes X 2011 1050.15
4 Mercedes X 2012 1100.30
5 Mercedes X 2020 3000.50
6 Renault Q 2011 NaN
1 Renault Y 2010 1000.40
Upvotes: 2
Reputation: 584
Based on the solution of @richardec, but with some addition to correct the price if the next year's price is known. Not sure if it faster than your original solution though
# Make an interpolated average
df_out = df.sort_values(['brand', 'year']).groupby('brand').apply(lambda g: g.interpolate(limit_area='inside'))
# Make an average per brand/year/model
df1 = df.sort_values(['brand', 'year']).groupby(['brand','year','model']).mean().reset_index()
# Check if the next line has the same brand and model. If so, take the next average price when the price isNa
mask1 = df1["model"] == df1["model"].shift(-1)
mask2 = df1["brand"] == df1["brand"].shift(-1)
mask3 = df1["price"].isna()
df1["priceCorr"] = np.where(mask1 & mask2 & mask3 ,df1["price"].shift(-1),df1["price"] )
# Merge everything together
df_out = df_out.merge(df1[["brand", "year", "model","priceCorr"]], on=["brand", "year", "model"])
df_out["price"] = np.where(df_out["price"].isna(),df_out["priceCorr"], df_out["price"])
Upvotes: 2
Reputation:
I think actually a more efficient way would be to sort by Brand
and then Year
, and then use interpolate
:
df = df.sort_values(['brand', 'year']).groupby('brand').apply(lambda g: g.interpolate(limit_area='inside'))
Output:
>>> df
brand model year price
0 Mercedes X 2011 1050.15
1 Renault Y 2010 1000.40
2 Ford Z 2009 2000.30
3 Mercedes X 2010 1000.00
4 Mercedes X 2012 1100.30
5 Mercedes X 2020 3000.50
6 Renault Q 2011 1000.40
That also handles all the columns.
Upvotes: 2