ComplexData
ComplexData

Reputation: 1103

Handle missing values based on other column values

I have a data frame df:

df = pd.DataFrame({'City': ['Cambridge','','Boston','Washignton','','Tampa',
'Danvers','Miami','Cambridge','Miami','','Washington'], 'State': ['MA','DC','MA',
'DC','MA','FL','MA','FL','MA','FL','FL','DC']})

As we can see in the above df, I have two columns "City" and "State". There are 3 cities with '' (No values). I want to assign a value to those missing values in cities. The assignment has to be done in the following way - City which exists the max number of times for a particular state should be assigned to the missing value for that particular state. For example: The 2nd missing city is corresponding to the state MA. Now if I carefully look at the data, "Cambridge" is the city which occurs the most number of times for the state MA. Therefore, that missing value should be replaced with "Cambridge".

Following the same trend, 1st missing city should be Washington, 2nd should be Cambridge and 3rd should be Miami.

How will I accomplish this task using pandas?

Upvotes: 3

Views: 1109

Answers (2)

piRSquared
piRSquared

Reputation: 294536

IIUC

df.update(df.replace('', np.nan).set_index('State', append=True) \
    .groupby(level='State').City \
    .apply(lambda x: x.fillna(x.value_counts().idxmax())) \
    .reset_index('State', drop=True))
df

enter image description here


reference code
note: additional column A is original City column making it easier to see where cities get filled in.

df = pd.DataFrame({
        'City': ['Cambridge','','Boston','Washignton','','Tampa',
                 'Danvers','Miami','Cambridge','Miami','','Washington'],
        'State': ['MA','DC','MA','DC','MA','FL',
                  'MA','FL','MA','FL','FL','DC'],
        'A': ['Cambridge','','Boston','Washignton','','Tampa',
              'Danvers','Miami','Cambridge','Miami','','Washington']})

df.update(df.replace('', np.nan).set_index('State', append=True) \
    .groupby(level='State').City \
    .apply(lambda x: x.fillna(x.value_counts().idxmax())) \
    .reset_index('State', drop=True))
df

enter image description here

Upvotes: 3

Alex
Alex

Reputation: 12943

top_cities = {}
for state in np.unique(df.State):
    cities = [city for city in df[df.State==state].City.values if city]
    top_cities[state] = max(set(cities), key=cities.count)

new_cities = []
for city, state in df.values:
    if city:
        new_cities.append(city)
    else:
        new_cities.append(top_cities[state])

df['City'] = new_cities

enter image description here

Upvotes: 0

Related Questions