chaikov
chaikov

Reputation: 505

formatting multiple city names into universal name for each city all at once using pandas

change all city name into one universal name.

        City    b   c
0   New york    1   1
1   New York    2   2
2   N.Y.        3   3
3   NY          4   4

They call refer to the city New york however python sees them as separate entity therefore I've changed all into one.

df["City"] = df["City"].replace({"N.Y.":"New york", "New York": "New york", "NY": "New york"})

After this I need to check if all variation of new york is covered, to do that I've created a function

def universal_ok(universal_name):
    count = 0
    for c in df.City:
        if c == universal_name:
            count += 1

    # This only works when column consists of only one type of city
    if count == len(df.City):
        return "Yes all names are formatted correctly"
    else:
        return f"there are {len(df.City) - count} names that need to be changed"

universal_ok("New york")

but the problem is what about when there are more than one city in a column

    City        b   c
0   New york    1   1
1   New York    2   2
2   N.Y.        3   3
3   NY          4   4
4   Toronto     3   2
5   TO          3   2
6   toronto     3   2

is there a way to change each city to universal name?

Upvotes: 0

Views: 742

Answers (2)

Trenton McKinney
Trenton McKinney

Reputation: 62513

Convert to Lower, Unique Values, Map and Count:

Data:

     City  b  c
 New york  1  1
 New York  2  2
     N.Y.  3  3
       NY  4  4
  Toronto  3  2
       TO  3  2
  toronto  3  2

Convert to Lower:

df.City = df.City.str.lower()

     City  b  c
 new york  1  1
 new york  2  2
     n.y.  3  3
       ny  4  4
  toronto  3  2
       to  3  2
  toronto  3  2

Unique Values:

df.City.unique()

array(['new york', 'n.y.', 'ny', 'toronto', 'to'], dtype=object)

Mapping the City Names:

  • Use the unique values list, to map the values to the preferred form
  • I created a tuple, then used dict comprehension to create the dictionary
    • I did this, so I wouldn't have to repeatedly type the preferred city name, because I'm lazy / efficient, that way.
  • Tuples
  • Python Dictionary Comprehension Tutorial
  • pandas.Series.map
cities_tup = (('New York', ['ny', 'n.y.', 'new york']),
              ('Toronto', ['toronto', 'to']))

cities_map = {y:x[0] for x in cities_tup for y in x[1]}

{'ny': 'New York',
 'n.y.': 'New York',
 'new york': 'New York',
 'toronto': 'Toronto',
 'to': 'Toronto'}

df.City = df.City.map(cities_map)

     City  b  c
 New York  1  1
 New York  2  2
 New York  3  3
 New York  4  4
  Toronto  3  2
  Toronto  3  2
  Toronto  3  2

Unique Counts to verify:

df.City.value_counts()

New York    4
Toronto     3
Name: City, dtype: int64

Remarks

  • Undoubtedly, there are alternate methods to accomplish this task, but I think this is straightforward and easy to follow.
    • Someone will probably come along and offer a one-liner.

Upvotes: 1

mouviciel
mouviciel

Reputation: 67879

You need a specific column with some sort of city id, otherwise you won’t be able to distinguish between Paris, France and Paris, Texas, nor will you be able to group Istanbul and Constantinople.

Upvotes: 0

Related Questions