How can I standardize names in Python using a dictionary map?

Question

I have a LOT of survey results data, and one column asked for which state the user was from. For example, some people wrote "VA" and others wrote "Virginia"

I was hoping to use a dictionary map, but things weren't working out so well. Does anyone have any suggestions for me? I am relatively new to Python, so I'm still trying to get the hang of things.

Here's what I've tried:

abv = {"Virginia": "VA", "Maryland": "MD",
      "West Virginia": "WV", "Pennsylvania": "PA"}
abv2 = dict(map(reversed, abv.items()))
survey['New State'] = survey.State.map(abv2)
survey

Some people typed "Virginia" and others wrote "VA". I only want the abbreviation version.

Brad Solomon · Accepted Answer

Let's say your DataFrame looks like this:

>>> import pandas as pd                                                         
>>> survey = pd.DataFrame( 
...     ["Virginia", "VA", "VA", "Penns.", "PA", "Pennsylvania"], 
...     columns=["State"] 
... )                                                                           
>>> survey                                                                      
          State
0      Virginia
1            VA
2            VA
3        Penns.
4            PA
5  Pennsylvania

The initial mapping you construct can be a mapping of longer-form names to the canonical abbreviations.

>>> to_abbrev = { 
...     "Virginia": "VA", 
...     "Pennsylvania": "PA", 
...     "Penns.": "PA", 
... }

Then, update that with the abbreviations themselves:

>>> to_abbrev.update({v: v for v in to_abbrev.values()})          
>>> to_abbrev                                                                                                                                                                                                                                                
{'Virginia': 'VA',
 'Pennsylvania': 'PA',
 'Penns.': 'PA',
 'VA': 'VA',
 'PA': 'PA'}

Finally, call .map() to get the result:

>>> survey["State"].map(to_abbrev)                                                                                                                                                                                                                           
0    VA
1    VA
2    VA
3    PA
4    PA
5    PA
Name: State, dtype: object

It's worth stating the semi-obvious: your to_abbrev must be a complete mapping; otherwise, missing values will be NaN:

>>> survey.append({"State": "Wisconsin"}, ignore_index=True)["State"].map(to_abbrev)                                                                                                                                                                         
0     VA
1     VA
2     VA
3     PA
4     PA
5     PA
6    NaN
Name: State, dtype: object

As suggested in the comments, there are undoubtedly libraries out there designed to build this mapping for you more wholistically, taking into account things such as common typos and small grammatical differences, such as "D.C." versus "DC."

How can I standardize names in Python using a dictionary map?

Answers (2)

Related Questions