Wouter
Wouter

Reputation: 1336

How to use pandas to fill missing data on specific categories of data within the dataframe?

How to use pandas to fill missing data on specific categories of data within the dataframe?

I have used fillna(method='ffill') after selecting a category but this creates NaN values on the other categories. Is there a better way?

I use this, i.e. I select the category 'Buurt' and apply the ffill (to insert missing zipcodes). But the rows with another category ('Wijk') will become NaN.

df['Zipcode'] = df.loc[(df['RegionType'] == 'Buurt'), 'Zipcode'].fillna(method='ffill')

df before and after:

           City                              Zipcode RegionType
Codering                                                   
BU06100305   Sliedrecht                     3362      Buurt
BU06100306   Sliedrecht                     3361      Buurt
BU06100307   Sliedrecht                     3361      Buurt
WK061004     Sliedrecht                        .       Wijk
BU06100401   Sliedrecht                      NaN      Buurt
BU06100402   Sliedrecht                      NaN      Buurt
BU06100403   Sliedrecht                     3364      Buurt
BU06100404   Sliedrecht                     3364      Buurt
BU06100405   Sliedrecht                     3364      Buurt
BU06100406   Sliedrecht                     3364      Buurt
           City                              Zipcode RegionType
Codering                                                   
BU06100305   Sliedrecht                     3362      Buurt
BU06100306   Sliedrecht                     3361      Buurt
BU06100307   Sliedrecht                     3361      Buurt
WK061004     Sliedrecht                      NaN       Wijk
BU06100401   Sliedrecht                     3361      Buurt
BU06100402   Sliedrecht                     3361      Buurt
BU06100403   Sliedrecht                     3364      Buurt
BU06100404   Sliedrecht                     3364      Buurt
BU06100405   Sliedrecht                     3364      Buurt
BU06100406   Sliedrecht                     3364      Buurt

I could change the NaNs back to a dot (.) but feel there must be a better way. Especially because the rows for 'Buurt' were actually filled with a dot (.) and I changed these to NaN to do the fillna(method='ffill') in the first place...

serieBuurtNoZipcode = (df['RegioType'] == 'Buurt') & (df['Zipcode'] == '.') 
df.loc[(serieBuurtNoZipcode), 'Zipcode'] = np.nan

How can I do this in a smarter way? Many thanks in advance!

Upvotes: 2

Views: 892

Answers (1)

jezrael
jezrael

Reputation: 863731

I think you need assign to filtered column Zipcode too:

mask = (df['RegionType'] == 'Buurt')
df.loc[mask, 'Zipcode'] = df.loc[mask, 'Zipcode'].ffill()
print (df)
                  City Zipcode RegionType
Codering                                 
BU06100305  Sliedrecht    3362      Buurt
BU06100306  Sliedrecht    3361      Buurt
BU06100307  Sliedrecht    3361      Buurt
WK061004    Sliedrecht       .       Wijk
BU06100401  Sliedrecht    3361      Buurt
BU06100402  Sliedrecht    3361      Buurt
BU06100403  Sliedrecht    3364      Buurt
BU06100404  Sliedrecht    3364      Buurt
BU06100405  Sliedrecht    3364      Buurt
BU06100406  Sliedrecht    3364      Buurt

But if want replace all categories by ffill use groupby:

df['Zipcode'] = df.groupby('RegionType')['Zipcode'].ffill()

More general:

df['Zipcode'] = df.groupby('RegionType')['Zipcode'].apply(lambda x: x.ffill().bfill())

Upvotes: 2

Related Questions