Reputation: 109
I am working on a carsale data set having columns : 'car', 'price', 'body', 'mileage', 'engV', 'engType', 'registration','year', 'model', 'drive'
column 'drive' and 'engType' have NaN missing values, I want to calculate mode for let say for 'drive' based on group by of ['car', 'model'] and then where this group falls, I want to replace NaN value there based on this groupby
I have tried these methods:
for numeric data
carsale['engV2'] = (carsale.groupby(['car','body','model']))['engV'].transform(lambda x: x.fillna(x.median()))
for categorical data
carsale['driveT'] = (carsale.groupby(['car','model']))['drive'].transform(lambda x: x.fillna(x.mode())) carsale['driveT'] = (carsale.groupby(['car','model']))['drive'].transform(lambda x: x.fillna(pd.Series.mode(x)))
both are giving same results
# carsale['price2'] = (carsale.groupby(['car','model','year']))['price'].transform(lambda x: x.fillna(x.median()))
# carsale['engV2'] = (carsale.groupby(['car','body','model']))['engV'].transform(lambda x: x.fillna(x.median()))
# carsale['mileage2'] = (carsale.groupby(['car','model','year']))['mileage'].transform(lambda x: x.fillna(x.median()))
# mode = carsale.filter(['car','drive']).mode()
# carsale[['test1','test2']] = carsale[['car','engType']].fillna(carsale.mode().iloc[0])
**carsale.groupby(['car', 'model'])['engType'].apply(pd.Series.mode)**
# carsale.apply()
# carsale
# carsale['engType2'] = carsale.groupby('car').engType.transform(lambda x: x.fillna(x.mode()))
**carsale['driveT'] = carsale.groupby(['car', 'model'])['drive'
].transform(lambda x: x.fillna(x.mode()))
carsale['driveT'] = carsale.groupby(['car', 'model'])['drive'
].transform(lambda x: x.fillna(pd.Series.mode(x)))**
# carsale[carsale.car == 'Mercedes-Benz'].sort_values(['body','engType','model','mileage']).tail(50)
# carsale[carsale.engV.isnull()]
# carsale.sort_values(['car','body','model'])
**carsale**
from above both methods giving the same results, it is just replacing/adding values in new column driveT same as we have in origional column 'drive'. like if we have NaN in some indexes then it is showing same NaN in driveT as well and same for other values.
But for numerical data, if i applied median it is adding/replacing correct value.
So the thing is it actually not calculating mode based on ['car', 'model'] group instead it is doing mode for single values in 'drive', but if you run this command
**carsale.groupby(['car','model'])['engType'].apply(pd.Series.mode)**
this is correctly calculating mode based on groupby (car, model)
Can anyone help in this matter?
Upvotes: 1
Views: 1273
Reputation: 1261
My approach was to:
drive
feature for each car
/model
combo. drive
is null.However, turned out there were two key corner cases specific to OP's dataset that needed to be handled:
drive
column for this combo were NaN).Below are the steps I followed. If I begin with an example extended from first several rows of the sample dataframe in the question:
carsale = pd.DataFrame({'car': ['Ford', 'Mercedes-Benz', 'Mercedes-Benz', 'Mercedes-Benz', 'Mercedes-Benz', 'Nissan', 'Honda','Renault', 'Mercedes-Benz', 'Mercedes-Benz', 'Toyota', 'Toyota', 'Ferrari'],
'price': [15500.000, 20500.000, 35000.000, 17800.000, 33000.000, 16600.000, 6500.000, 10500.000, 21500.000, 21500.000, 1280.000, 2005.00, 300000.000],
'body': ['crossover', 'sedan', 'other', 'van', 'vagon', 'crossover', 'sedan', 'vagon', 'sedan', 'sedan', 'compact', 'compact', 'sport'],
'mileage': [68.0, 173.0, 135.0, 162.0, 91.0, 83.0, 199.0, 185.0, 146.0, 146.0, 200.0, 134, 123.0],
'engType': ['Gas', 'Gas', 'Petrol', 'Diesel', np.nan, 'Petrol', 'Petrol', 'Diesel', 'Gas', 'Gas', 'Hybrid', 'Gas', 'Gas'],
'registration':['yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes'],
'year': [2010, 2011, 2008, 2012, 2013, 2013, 2003, 2011, 2012, 2012, 2009, 2003, 1988],
'model': ['Kuga', 'E-Class', 'CL 550', 'B 180', 'E-Class', 'X-Trail', 'Accord', 'Megane', 'E-Class', 'E-Class', 'Prius', 'Corolla', 'Testarossa'],
'drive': ['full', 'rear', 'rear', 'front', np.nan, 'full', 'front', 'front', 'rear', np.nan, np.nan, 'front', np.nan],
})
carsale
car price body mileage engType registration year model drive
0 Ford 15500.0 crossover 68.0 Gas yes 2010 Kuga full
1 Mercedes-Benz 20500.0 sedan 173.0 Gas yes 2011 E-Class rear
2 Mercedes-Benz 35000.0 other 135.0 Petrol yes 2008 CL 550 rear
3 Mercedes-Benz 17800.0 van 162.0 Diesel yes 2012 B 180 front
4 Mercedes-Benz 33000.0 vagon 91.0 NaN yes 2013 E-Class NaN
5 Nissan 16600.0 crossover 83.0 Petrol yes 2013 X-Trail full
6 Honda 6500.0 sedan 199.0 Petrol yes 2003 Accord front
7 Renault 10500.0 vagon 185.0 Diesel yes 2011 Megane front
8 Mercedes-Benz 21500.0 sedan 146.0 Gas yes 2012 E-Class rear
9 Mercedes-Benz 21500.0 sedan 146.0 Gas yes 2012 E-Class NaN
10 Toyota 1280.0 compact 200.0 Hybrid yes 2009 Prius NaN
11 Toyota 2005.0 compact 134.0 Gas yes 2003 Corolla front
12 Ferrari 300000.0 sport 123.0 Gas yes 1988 Testarossa NaN
Create a dataframe to that shows the mode of the drive
feature for each car
/model
combination.
If a car/model combo has no mode (such as the row with Toyota Prius), I fill with the mode of that particular car brand (Toyota).
However, if the car brand, itself, (such as Ferrari here in my example) has no mode, I fill with the dataset's mode for the drive
feature.
def get_drive_mode(x):
brand = x.name[0]
if x.count() > 0:
return x.mode() # Return mode for a brand/model if the mode exists.
elif carsale.groupby(['car'])['drive'].count()[brand] > 0:
brand_mode = carsale.groupby(['car'])['drive'].apply(lambda x: x.mode())[brand]
return brand_mode # Return mode of brand if particular brand/model combo has no mode,
else: # but brand itself has a mode for the 'drive' feature.
return carsale['drive'].mode() # Otherwise return dataset's mode for the 'drive' feature.
drive_modes = carsale.groupby(['car','model'])['drive'].apply(get_drive_mode).reset_index().drop('level_2', axis=1)
drive_modes.rename(columns={'drive': 'drive_mode'}, inplace=True)
drive_modes
car model drive_mode
0 Ferrari Testarossa front
1 Ford Kuga full
2 Honda Accord front
3 Mercedes-Benz B 180 front
4 Mercedes-Benz CL 550 rear
5 Mercedes-Benz E-Class rear
6 Nissan X-Trail full
7 Renault Megane front
8 Toyota Corolla front
9 Toyota Prius front
drive
mode value for a given car/model in a given row if that row's value for drive
is NaN:def fill_with_mode(x):
if pd.isnull(x['drive']):
return drive_modes[(drive_modes['car'] == x['car']) & (drive_modes['model'] == x['model'])]['drive_mode'].values[0]
else:
return x['drive']
carsale
dataframe in order to create the driveT
feature:carsale['driveT'] = carsale.apply(fill_with_mode, axis=1)
del(drive_modes)
Which results in the following dataframe:
carsale
car price body mileage engType registration year model drive driveT
0 Ford 15500.0 crossover 68.0 Gas yes 2010 Kuga full full
1 Mercedes-Benz 20500.0 sedan 173.0 Gas yes 2011 E-Class rear rear
2 Mercedes-Benz 35000.0 other 135.0 Petrol yes 2008 CL 550 rear rear
3 Mercedes-Benz 17800.0 van 162.0 Diesel yes 2012 B 180 front front
4 Mercedes-Benz 33000.0 vagon 91.0 NaN yes 2013 E-Class NaN rear
5 Nissan 16600.0 crossover 83.0 Petrol yes 2013 X-Trail full full
6 Honda 6500.0 sedan 199.0 Petrol yes 2003 Accord front front
7 Renault 10500.0 vagon 185.0 Diesel yes 2011 Megane front front
8 Mercedes-Benz 21500.0 sedan 146.0 Gas yes 2012 E-Class rear rear
9 Mercedes-Benz 21500.0 sedan 146.0 Gas yes 2012 E-Class NaN rear
10 Toyota 1280.0 compact 200.0 Hybrid yes 2009 Prius NaN front
11 Toyota 2005.0 compact 134.0 Gas yes 2003 Corolla front front
12 Ferrari 300000.0 sport 123.0 Gas yes 1988 Testarossa NaN front
Notice that in rows 4 and 9 of the driveT
column, the NaN value that was in the drive
column has been replaced by the string rear
, which as we would expect, is the mode of drive
for a Mercedes E-Class.
Also, in row 11, since there is no mode for the Toyota Prius car/model combo, we fill with the mode for the Toyota brand, which is front
.
Finally, in row 12, since there is no mode for the Ferrari car brand, we fill with the mode of the entire dataset's drive
column, which is also front
.
Upvotes: 1