Orchid9
Orchid9

Reputation: 37

Clean up city names in python

in a dataframe, there is a 'City' column containing different city names as well as various formats/misspellings. Here are the unique values in the column:

array(['somerville', 'hyde park', 'lexington', 'brookline', 'wellesley',
       'dover ', 'newton', 'westford', 'boston', 'needham', 'arlington',
       'wayland', 'waltham', 'cambridge', 'chestnuthill', 'salisbury ',
       'waban', 'weston', 'neeham', 'auburndale', 'belmont', 'allston',
       'auberdale', 'bedford', 'dover', 'lawrence', 'wilmington',
       'woburn', 'braintree', 'acton', 'winchester', 'middleton',
       'west newton', 'watertown', 'newton center', 'northfield',
       'roslindale', 'westwood', 'groton', 'natick', 'concord',
       'chestnut hill', 'westborough', 'sudbury', 'sherborn', 'quincy',
       'burlington', 'andover', 'littleton', 'stoughton'], dtype=object)

I want to clean up only four cities names using mapping, and leave other city names unchanged.

I used this code below:

cities_names = (('Newton', ['west newton', 'newton center', 'chestnut hill', 'chestnuthill', 'waban', 'auberdale', 'auburndale']),
              ('Dover', ['dover ']), 
              ('Needham', ['neeham']), 
              ('Wellesley', ['wellesly']))

cities_map = {y:x[0] for x in cities_tup for y in x[1]}

df_MA.City = df_MA.City.map(cities_map)
df_MA.City.unique()

But the output is : array([nan, 'Dover', 'Newton', 'Needham'], dtype=object)

So basically, it changed all other city names into nan which is not I want. Are there any other methods or packages to clean up city names?

Upvotes: 2

Views: 453

Answers (2)

mozway
mozway

Reputation: 262234

Use replace:

df_MA['City'] = df_MA['City'].replace(cities_map)

Or combine map and fillna:

df_MA['City'] = df_MA['City'].map(cities_map).fillna(df_MA['City'])

NB. Don't assign to df_MA.City, always use the square brackets notation: df_MA['City'].

Upvotes: 2

Saxtheowl
Saxtheowl

Reputation: 4662

Your code is only mapping the four specified cities so the others became NaN

I modified your code to use a loop and statement to check if a city needs to be mapped:

cities_names = (('Newton', ['west newton', 'newton center', 'chestnut hill', 'chestnuthill', 'waban', 'auberdale', 'auburndale']),
              ('Dover', ['dover ']), 
              ('Needham', ['neeham']), 
              ('Wellesley', ['wellesly']))

for city_name, variants in cities_names:
    for variant in variants:
        df_MA.loc[df_MA.City.str.lower() == variant, 'City'] = city_name

df_MA.City.unique()

Upvotes: 0

Related Questions