Reputation: 724
I have a LOT of survey results data, and one column asked for which state the user was from. For example, some people wrote "VA" and others wrote "Virginia"
I was hoping to use a dictionary map, but things weren't working out so well. Does anyone have any suggestions for me? I am relatively new to Python, so I'm still trying to get the hang of things.
Here's what I've tried:
abv = {"Virginia": "VA", "Maryland": "MD",
"West Virginia": "WV", "Pennsylvania": "PA"}
abv2 = dict(map(reversed, abv.items()))
survey['New State'] = survey.State.map(abv2)
survey
Some people typed "Virginia" and others wrote "VA". I only want the abbreviation version.
Upvotes: 0
Views: 1249
Reputation: 6663
If you really cannot validate the user input frontend, you could easily use the get
method of the dictionary, providing a default value as fallback:
def fix(user_input):
mapping = {"Virginia": "VA", "Maryland": "MD",
"West Virginia": "WV", "Pennsylvania": "PA"}
return mapping.get(user_input, user_input)
print(fix("Virginia")) # >> VA
print(fix("VA")) # >> VA
Upvotes: 0
Reputation: 40918
Let's say your DataFrame looks like this:
>>> import pandas as pd
>>> survey = pd.DataFrame(
... ["Virginia", "VA", "VA", "Penns.", "PA", "Pennsylvania"],
... columns=["State"]
... )
>>> survey
State
0 Virginia
1 VA
2 VA
3 Penns.
4 PA
5 Pennsylvania
The initial mapping you construct can be a mapping of longer-form names to the canonical abbreviations.
>>> to_abbrev = {
... "Virginia": "VA",
... "Pennsylvania": "PA",
... "Penns.": "PA",
... }
Then, update that with the abbreviations themselves:
>>> to_abbrev.update({v: v for v in to_abbrev.values()})
>>> to_abbrev
{'Virginia': 'VA',
'Pennsylvania': 'PA',
'Penns.': 'PA',
'VA': 'VA',
'PA': 'PA'}
Finally, call .map()
to get the result:
>>> survey["State"].map(to_abbrev)
0 VA
1 VA
2 VA
3 PA
4 PA
5 PA
Name: State, dtype: object
It's worth stating the semi-obvious: your to_abbrev
must be a complete mapping; otherwise, missing values will be NaN:
>>> survey.append({"State": "Wisconsin"}, ignore_index=True)["State"].map(to_abbrev)
0 VA
1 VA
2 VA
3 PA
4 PA
5 PA
6 NaN
Name: State, dtype: object
As suggested in the comments, there are undoubtedly libraries out there designed to build this mapping for you more wholistically, taking into account things such as common typos and small grammatical differences, such as "D.C." versus "DC."
Upvotes: 1