Rohit Negi
Rohit Negi

Reputation: 109

Why fillna not working as expected for mode

I am working on a carsale data set having columns : 'car', 'price', 'body', 'mileage', 'engV', 'engType', 'registration','year', 'model', 'drive'

column 'drive' and 'engType' have NaN missing values, I want to calculate mode for let say for 'drive' based on group by of ['car', 'model'] and then where this group falls, I want to replace NaN value there based on this groupby

I have tried these methods:

both are giving same results

enter image description here



# carsale['price2'] = (carsale.groupby(['car','model','year']))['price'].transform(lambda x: x.fillna(x.median()))

# carsale['engV2'] = (carsale.groupby(['car','body','model']))['engV'].transform(lambda x: x.fillna(x.median()))
# carsale['mileage2'] = (carsale.groupby(['car','model','year']))['mileage'].transform(lambda x: x.fillna(x.median()))
# mode = carsale.filter(['car','drive']).mode()
# carsale[['test1','test2']] = carsale[['car','engType']].fillna(carsale.mode().iloc[0])

**carsale.groupby(['car', 'model'])['engType'].apply(pd.Series.mode)**

# carsale.apply()
# carsale
# carsale['engType2'] = carsale.groupby('car').engType.transform(lambda x: x.fillna(x.mode()))

**carsale['driveT'] = carsale.groupby(['car', 'model'])['drive'
        ].transform(lambda x: x.fillna(x.mode()))
carsale['driveT'] = carsale.groupby(['car', 'model'])['drive'
        ].transform(lambda x: x.fillna(pd.Series.mode(x)))**

# carsale[carsale.car == 'Mercedes-Benz'].sort_values(['body','engType','model','mileage']).tail(50)
# carsale[carsale.engV.isnull()]
# carsale.sort_values(['car','body','model'])

**carsale**

from above both methods giving the same results, it is just replacing/adding values in new column driveT same as we have in origional column 'drive'. like if we have NaN in some indexes then it is showing same NaN in driveT as well and same for other values.

But for numerical data, if i applied median it is adding/replacing correct value.

So the thing is it actually not calculating mode based on ['car', 'model'] group instead it is doing mode for single values in 'drive', but if you run this command

**carsale.groupby(['car','model'])['engType'].apply(pd.Series.mode)**

this is correctly calculating mode based on groupby (car, model)

enter image description here

Can anyone help in this matter?

Upvotes: 1

Views: 1273

Answers (1)

James Dellinger
James Dellinger

Reputation: 1261

My approach was to:

  1. Use .groupby() to create a look-up dataframe that contains the mode of the drive feature for each car/model combo.
  2. Write a method that looks up the mode in this dataframe and returns it for a given car/model, when that car/model's value in drive is null.

However, turned out there were two key corner cases specific to OP's dataset that needed to be handled:

  • When a particular car/model combo has no mode (because all entries in the drive column for this combo were NaN).
  • When a particular car brand has no mode.

Below are the steps I followed. If I begin with an example extended from first several rows of the sample dataframe in the question:

carsale = pd.DataFrame({'car': ['Ford', 'Mercedes-Benz', 'Mercedes-Benz', 'Mercedes-Benz', 'Mercedes-Benz', 'Nissan', 'Honda','Renault', 'Mercedes-Benz', 'Mercedes-Benz', 'Toyota', 'Toyota', 'Ferrari'],
                   'price': [15500.000, 20500.000, 35000.000, 17800.000, 33000.000, 16600.000, 6500.000, 10500.000, 21500.000, 21500.000, 1280.000, 2005.00, 300000.000],
                   'body': ['crossover', 'sedan', 'other', 'van', 'vagon', 'crossover', 'sedan', 'vagon', 'sedan', 'sedan', 'compact', 'compact', 'sport'],
                   'mileage': [68.0, 173.0, 135.0, 162.0, 91.0, 83.0, 199.0, 185.0, 146.0, 146.0, 200.0, 134, 123.0],
                   'engType': ['Gas', 'Gas', 'Petrol', 'Diesel', np.nan, 'Petrol', 'Petrol', 'Diesel', 'Gas', 'Gas', 'Hybrid', 'Gas', 'Gas'],
                   'registration':['yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes'],
                   'year': [2010, 2011, 2008, 2012, 2013, 2013, 2003, 2011, 2012, 2012, 2009, 2003, 1988],
                   'model': ['Kuga', 'E-Class', 'CL 550', 'B 180', 'E-Class', 'X-Trail', 'Accord', 'Megane', 'E-Class', 'E-Class', 'Prius', 'Corolla', 'Testarossa'],
                   'drive': ['full', 'rear', 'rear', 'front', np.nan, 'full', 'front', 'front', 'rear', np.nan, np.nan, 'front', np.nan],
                  })
carsale

    car               price  body       mileage   engType   registration  year  model       drive
0   Ford            15500.0  crossover     68.0   Gas       yes           2010  Kuga        full
1   Mercedes-Benz   20500.0  sedan        173.0   Gas       yes           2011  E-Class     rear
2   Mercedes-Benz   35000.0  other        135.0   Petrol    yes           2008  CL 550      rear
3   Mercedes-Benz   17800.0  van          162.0   Diesel    yes           2012  B 180       front
4   Mercedes-Benz   33000.0  vagon         91.0   NaN       yes           2013  E-Class     NaN
5   Nissan          16600.0  crossover     83.0   Petrol    yes           2013  X-Trail     full
6   Honda            6500.0  sedan        199.0   Petrol    yes           2003  Accord      front
7   Renault         10500.0  vagon        185.0   Diesel    yes           2011  Megane      front
8   Mercedes-Benz   21500.0  sedan        146.0   Gas       yes           2012  E-Class     rear
9   Mercedes-Benz   21500.0  sedan        146.0   Gas       yes           2012  E-Class     NaN
10  Toyota           1280.0  compact      200.0   Hybrid    yes           2009  Prius       NaN
11  Toyota           2005.0  compact      134.0   Gas       yes           2003  Corolla     front
12  Ferrari        300000.0  sport        123.0   Gas       yes           1988  Testarossa  NaN
  1. Create a dataframe to that shows the mode of the drive feature for each car/model combination.

    If a car/model combo has no mode (such as the row with Toyota Prius), I fill with the mode of that particular car brand (Toyota).

    However, if the car brand, itself, (such as Ferrari here in my example) has no mode, I fill with the dataset's mode for the drive feature.

def get_drive_mode(x):
    brand = x.name[0]
    if x.count() > 0:
        return x.mode() # Return mode for a brand/model if the mode exists.
    elif carsale.groupby(['car'])['drive'].count()[brand] > 0:
        brand_mode = carsale.groupby(['car'])['drive'].apply(lambda x: x.mode())[brand]
        return brand_mode # Return mode of brand if particular brand/model combo has no mode,
    else:                 # but brand itself has a mode for the 'drive' feature. 
        return carsale['drive'].mode() # Otherwise return dataset's mode for the 'drive' feature.

drive_modes = carsale.groupby(['car','model'])['drive'].apply(get_drive_mode).reset_index().drop('level_2', axis=1)
drive_modes.rename(columns={'drive': 'drive_mode'}, inplace=True)
drive_modes

    car             model        drive_mode
0   Ferrari         Testarossa   front
1   Ford            Kuga         full
2   Honda           Accord       front
3   Mercedes-Benz   B 180        front
4   Mercedes-Benz   CL 550       rear
5   Mercedes-Benz   E-Class      rear
6   Nissan          X-Trail      full
7   Renault         Megane       front
8   Toyota          Corolla      front
9   Toyota          Prius        front
  1. Write a method that looks up the drive mode value for a given car/model in a given row if that row's value for drive is NaN:
def fill_with_mode(x):
    if pd.isnull(x['drive']):
        return drive_modes[(drive_modes['car'] == x['car']) & (drive_modes['model'] == x['model'])]['drive_mode'].values[0]
    else:
        return x['drive']
  1. Apply the above method to the rows in the carsale dataframe in order to create the driveT feature:
carsale['driveT'] = carsale.apply(fill_with_mode, axis=1)
del(drive_modes)

Which results in the following dataframe:

carsale

    car               price  body       mileage   engType   registration  year  model       drive   driveT
0   Ford            15500.0  crossover     68.0   Gas       yes           2010  Kuga        full    full
1   Mercedes-Benz   20500.0  sedan        173.0   Gas       yes           2011  E-Class     rear    rear
2   Mercedes-Benz   35000.0  other        135.0   Petrol    yes           2008  CL 550      rear    rear
3   Mercedes-Benz   17800.0  van          162.0   Diesel    yes           2012  B 180       front   front
4   Mercedes-Benz   33000.0  vagon         91.0   NaN       yes           2013  E-Class     NaN     rear
5   Nissan          16600.0  crossover     83.0   Petrol    yes           2013  X-Trail     full    full
6   Honda            6500.0  sedan        199.0   Petrol    yes           2003  Accord      front   front
7   Renault         10500.0  vagon        185.0   Diesel    yes           2011  Megane      front   front
8   Mercedes-Benz   21500.0  sedan        146.0   Gas       yes           2012  E-Class     rear    rear
9   Mercedes-Benz   21500.0  sedan        146.0   Gas       yes           2012  E-Class     NaN     rear
10  Toyota           1280.0  compact      200.0   Hybrid    yes           2009  Prius       NaN     front
11  Toyota           2005.0  compact      134.0   Gas       yes           2003  Corolla     front   front
12  Ferrari        300000.0  sport        123.0   Gas       yes           1988  Testarossa  NaN     front

Notice that in rows 4 and 9 of the driveT column, the NaN value that was in the drive column has been replaced by the string rear, which as we would expect, is the mode of drive for a Mercedes E-Class.

Also, in row 11, since there is no mode for the Toyota Prius car/model combo, we fill with the mode for the Toyota brand, which is front.

Finally, in row 12, since there is no mode for the Ferrari car brand, we fill with the mode of the entire dataset's drive column, which is also front.

Upvotes: 1

Related Questions