Handle missing values based on other column values

Question

I have a data frame df:

df = pd.DataFrame({'City': ['Cambridge','','Boston','Washignton','','Tampa',
'Danvers','Miami','Cambridge','Miami','','Washington'], 'State': ['MA','DC','MA',
'DC','MA','FL','MA','FL','MA','FL','FL','DC']})

As we can see in the above df, I have two columns "City" and "State". There are 3 cities with '' (No values). I want to assign a value to those missing values in cities. The assignment has to be done in the following way - City which exists the max number of times for a particular state should be assigned to the missing value for that particular state. For example: The 2nd missing city is corresponding to the state MA. Now if I carefully look at the data, "Cambridge" is the city which occurs the most number of times for the state MA. Therefore, that missing value should be replaced with "Cambridge".

Following the same trend, 1st missing city should be Washington, 2nd should be Cambridge and 3rd should be Miami.

How will I accomplish this task using pandas?

Alex · Accepted Answer

top_cities = {}
for state in np.unique(df.State):
    cities = [city for city in df[df.State==state].City.values if city]
    top_cities[state] = max(set(cities), key=cities.count)

new_cities = []
for city, state in df.values:
    if city:
        new_cities.append(city)
    else:
        new_cities.append(top_cities[state])

df['City'] = new_cities

Handle missing values based on other column values

Answers (2)

Related Questions