user1993617
user1993617

Reputation: 53

Filling missing values of categorical values based on other categorical values in pandas dataframe

I want to fill missing values of categorical values in Pandas data frame with the most frequent values on another category. For example,

import pandas as pd
import numpy as np
data = {'type': ['softdrink', 'juice', 'softdrink', 'softdrink',    'juice','juice','juice'],
    'product': ['coca', np.nan, 'pepsi', 'pepsi', 'orange','grape',np.nan], 
    'price': [25, 94, 57, 62, 70,50,60]}
df = pd.DataFrame(data)
df

which results in

      price     | product   |   type    
0   25          |  coca     | softdrink   
1   94          |   NaN     | juice    
2   57          |   pepsi   | softdrink    
3   62          |   pepsi   | softdrink    
4   70          |   orange  | juice    
5   50          |    grape  | juice    
6   60          |   NaN     | softdrink    

First, I use groupby as

df.groupby('type')['product'].value_counts()   

to get

type      |   product    
juice     |    grape  |   1    
          |   orange  |   1    
softdrink | pepsi     |   2    
          | coca      |   1    
Name: product, dtype: int64    

I want to fill a missing product of second row with "pepsi" (the most infrequence) but filling "grape" for missing value of row 6 of category "juice". Without categorical group, my solution is to find most frequent value by the column and assign this value to missing value.

df['product'].fillna(df['product'].value_counts().index[0],inplace=True)

I struggle to complete the task since the return value of the command

df.groupby('type')['product'].value_counts()

is pandas series which can be accessed by

df.groupby('type')['product'].value_counts()['softdrink']['pepsi']

how I know which product+category has the most frequence.

Upvotes: 3

Views: 7487

Answers (1)

BENY
BENY

Reputation: 323226

IIUC

Using mode


Data input

import pandas as pd
import numpy as np
data = {'type': ['softdrink', 'juice', 'softdrink', 'softdrink',    'juice','juice','softdrink'],
    'product': ['coca', np.nan, 'pepsi', 'pepsi', 'orange','grape',np.nan],
    'price': [25, 94, 57, 62, 70,50,60]}
df = pd.DataFrame(data)

solution

df.groupby('type').product.transform(lambda x: x.fillna(x.mode()[0]))

Out[28]: 
0      coca
1     grape
2     pepsi
3     pepsi
4    orange
5     grape
6     pepsi
Name: product, dtype: object

New df

df['product']=df.groupby('type').product.transform(lambda x: x.fillna(x.mode()[0]))
df
Out[40]: 
   price product       type
0     25    coca  softdrink
1     94   grape      juice
2     57   pepsi  softdrink
3     62   pepsi  softdrink
4     70  orange      juice
5     50   grape      juice
6     60   pepsi  softdrink

Upvotes: 3

Related Questions