324
324

Reputation: 724

How can I standardize names in Python using a dictionary map?

I have a LOT of survey results data, and one column asked for which state the user was from. For example, some people wrote "VA" and others wrote "Virginia"

I was hoping to use a dictionary map, but things weren't working out so well. Does anyone have any suggestions for me? I am relatively new to Python, so I'm still trying to get the hang of things.

Here's what I've tried:

abv = {"Virginia": "VA", "Maryland": "MD",
      "West Virginia": "WV", "Pennsylvania": "PA"}
abv2 = dict(map(reversed, abv.items()))
survey['New State'] = survey.State.map(abv2)
survey

Some people typed "Virginia" and others wrote "VA". I only want the abbreviation version.

Upvotes: 0

Views: 1249

Answers (2)

olinox14
olinox14

Reputation: 6663

If you really cannot validate the user input frontend, you could easily use the get method of the dictionary, providing a default value as fallback:

def fix(user_input):
    mapping = {"Virginia": "VA", "Maryland": "MD",
               "West Virginia": "WV", "Pennsylvania": "PA"}

    return mapping.get(user_input, user_input)

print(fix("Virginia"))  # >> VA
print(fix("VA"))  # >> VA

Upvotes: 0

Brad Solomon
Brad Solomon

Reputation: 40918

Let's say your DataFrame looks like this:

>>> import pandas as pd                                                         
>>> survey = pd.DataFrame( 
...     ["Virginia", "VA", "VA", "Penns.", "PA", "Pennsylvania"], 
...     columns=["State"] 
... )                                                                           
>>> survey                                                                      
          State
0      Virginia
1            VA
2            VA
3        Penns.
4            PA
5  Pennsylvania

The initial mapping you construct can be a mapping of longer-form names to the canonical abbreviations.

>>> to_abbrev = { 
...     "Virginia": "VA", 
...     "Pennsylvania": "PA", 
...     "Penns.": "PA", 
... }

Then, update that with the abbreviations themselves:

>>> to_abbrev.update({v: v for v in to_abbrev.values()})          
>>> to_abbrev                                                                                                                                                                                                                                                
{'Virginia': 'VA',
 'Pennsylvania': 'PA',
 'Penns.': 'PA',
 'VA': 'VA',
 'PA': 'PA'}

Finally, call .map() to get the result:

>>> survey["State"].map(to_abbrev)                                                                                                                                                                                                                           
0    VA
1    VA
2    VA
3    PA
4    PA
5    PA
Name: State, dtype: object

It's worth stating the semi-obvious: your to_abbrev must be a complete mapping; otherwise, missing values will be NaN:

>>> survey.append({"State": "Wisconsin"}, ignore_index=True)["State"].map(to_abbrev)                                                                                                                                                                         
0     VA
1     VA
2     VA
3     PA
4     PA
5     PA
6    NaN
Name: State, dtype: object

As suggested in the comments, there are undoubtedly libraries out there designed to build this mapping for you more wholistically, taking into account things such as common typos and small grammatical differences, such as "D.C." versus "DC."

Upvotes: 1

Related Questions