Reputation: 1103
I have a data frame df:
df = pd.DataFrame({'City': ['Cambridge','','Boston','Washignton','','Tampa',
'Danvers','Miami','Cambridge','Miami','','Washington'], 'State': ['MA','DC','MA',
'DC','MA','FL','MA','FL','MA','FL','FL','DC']})
As we can see in the above df, I have two columns "City" and "State". There are 3 cities with '' (No values). I want to assign a value to those missing values in cities. The assignment has to be done in the following way - City which exists the max number of times for a particular state should be assigned to the missing value for that particular state. For example: The 2nd missing city is corresponding to the state MA. Now if I carefully look at the data, "Cambridge" is the city which occurs the most number of times for the state MA. Therefore, that missing value should be replaced with "Cambridge".
Following the same trend, 1st missing city should be Washington, 2nd should be Cambridge and 3rd should be Miami.
How will I accomplish this task using pandas?
Upvotes: 3
Views: 1109
Reputation: 294536
IIUC
df.update(df.replace('', np.nan).set_index('State', append=True) \
.groupby(level='State').City \
.apply(lambda x: x.fillna(x.value_counts().idxmax())) \
.reset_index('State', drop=True))
df
reference code
note: additional column A
is original City
column making it easier to see where cities get filled in.
df = pd.DataFrame({
'City': ['Cambridge','','Boston','Washignton','','Tampa',
'Danvers','Miami','Cambridge','Miami','','Washington'],
'State': ['MA','DC','MA','DC','MA','FL',
'MA','FL','MA','FL','FL','DC'],
'A': ['Cambridge','','Boston','Washignton','','Tampa',
'Danvers','Miami','Cambridge','Miami','','Washington']})
df.update(df.replace('', np.nan).set_index('State', append=True) \
.groupby(level='State').City \
.apply(lambda x: x.fillna(x.value_counts().idxmax())) \
.reset_index('State', drop=True))
df
Upvotes: 3
Reputation: 12943
top_cities = {}
for state in np.unique(df.State):
cities = [city for city in df[df.State==state].City.values if city]
top_cities[state] = max(set(cities), key=cities.count)
new_cities = []
for city, state in df.values:
if city:
new_cities.append(city)
else:
new_cities.append(top_cities[state])
df['City'] = new_cities
Upvotes: 0