Reputation: 69
I have a dataframe with a column named 'States'
that lists various U.S. states. I need to create another column with a region specifier like 'Atlantic Coast'
I have lists of the states that belong to various regions so if the state in df['States']
matches a state in the list 'Atlantic_states'
the specifier 'Atlantic Coast'
is inserted into the new column df['region specifier']
my code below shows the list I want to compare my dataframe values with and the output of the df['States']
column.
#list of states
Atlantic_states = ['Virginia',
'Massachusetts',
'Maine',
'New York',
'Rhode Island',
'Connecticut',
'New Hampshire',
'Maryland',
'Delaware',
'New Jersey',
'North Carolina',
'South Carolina',
'Georgia',
'Florida']
print(df['States'])
Out:
States
0 Virginia
1 Massachusetts
2 Maine
3 New York
4 Rhode Island
5 Connecticut
6 New Hampshire
7 Maryland
8 Delaware
9 New Jersey
10 North Carolina
11 South Carolina
12 Georgia
13 Florida
14 Wisconsin
15 Michigan
16 Ohio
17 Pennsylvania
18 Illinois
19 Indiana
20 Minnesota
21 New York
22 Washington
23 Oregon
24 California
Upvotes: 2
Views: 1800
Reputation: 4744
Whilst Andy's answer works it is not the most efficient way of doing this. There is a handy method that can be called on almost all pandas Series-like objects: .isin()
. Entries to this can be lists, dicts and pandas Series.
df = pd.DataFrame(['Virginia','Massachusetts','Maine','New York','Rhode Island',
'Connecticut','New Hampshire','Maryland', 'Delaware',
'New Jersey','North Carolina', 'South Carolina','Georgia','Florida',
'Wisconsin','Michigan', 'Ohio','Pennsylvania','Illinois',
'Indiana','Minnesota','New York','Washington','Oregon',
'California'],
columns=['States'])
Atlantic_states = ['Virginia', 'Massachusetts', 'Maine', 'New York','Rhode Island',
'Connecticut', 'New Hampshire', 'Maryland', 'Delaware',
'New Jersey', 'North Carolina', 'South Carolina', 'Georgia',
'Florida']
df['Coast'] = np.where(df['States'].isin(Atlantic_states), 'Atlantic Coast',
'Unknown')
df.head()
Out[1]:
States Coast
0 Virginia Atlantic Coast
1 Massachusetts Atlantic Coast
2 Maine Atlantic Coast
3 New York Atlantic Coast
4 Rhode Island Atlantic Coast
Here are some timings using for mapping the first 10 letters of the alphabet to some random int
numbers:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(low=0, high=26, size=(1000000,1)),
columns=['numbers'])
letters = dict(zip(list(range(0, 10)), [i for i in 'abcdefghij']))
for apply
%%timeit
def is_atlantic(state):
return True if state in letters else False
df.numbers.apply(is_atlantic)
Out[]: 1 loops, best of 3: 432 ms per loop
Now for map as suggested by JohnE
%%timeit
df.numbers.map(letters)
Out[]: 10 loops, best of 3: 56.9 ms per loop
and finally for isin (also suggested by Nickil Maveli)
%%timeit
df.numbers.isin(letters)
Out[]: 10 loops, best of 3: 20.9 ms per loop
So we see that .isin()
is much quicker than .apply()
and twice as quick as .map()
.
Note: apply
and isin
just return the boolean masks and map
fills with the desired strings. Even so, when assigning to another column isin
wins out by about 2/3 of the time of map
.
Upvotes: 3
Reputation: 50560
You have a couple options. First, to directly answer the question as posed:
Create a function that returns whether a state is in the Atlantic region or not
def is_atlantic(state):
return "Atlantic" if state in Atlantic_states else "Unknown"
Now, you use .apply()
and get the results (and return it to your new column)
df['Region'] = df['State'].apply(is_atlantic)
This returns a data frame that looks like this:
State Region
0 Virginia Atlantic
1 Massachusetts Atlantic
2 Maine Atlantic
3 New York Atlantic
4 Rhode Island Atlantic
5 Connecticut Atlantic
6 New Hampshire Atlantic
7 Maryland Atlantic
8 Delaware Atlantic
9 New Jersey Atlantic
10 North Carolina Atlantic
11 South Carolina Atlantic
12 Georgia Atlantic
13 Florida Atlantic
14 Wisconsin Unknown
15 Michigan Unknown
16 Ohio Unknown
17 Pennsylvania Unknown
18 Illinois Unknown
19 Indiana Unknown
20 Minnesota Unknown
21 New York Atlantic
22 Washington Unknown
23 Oregon Unknown
24 California Unknown
The first option gets cumbersome if you have multiple lists you want to check against. Instead of having multiple lists, I recommend creating a single dictionary with the State as the key and the region as the value. With only 50 values this should be easy enough to maintain.
regions = {
'Virginia': 'Atlantic',
'Massachusetts': 'Atlantic',
'Maine': 'Atlantic',
'New York': 'Atlantic',
'Rhode Island': 'Atlantic',
'Connecticut': 'Atlantic',
'New Hampshire': 'Atlantic',
'Maryland': 'Atlantic',
'Delaware': 'Atlantic',
'New Jersey': 'Atlantic',
'North Carolina': 'Atlantic',
'South Carolina': 'Atlantic',
'Georgia': 'Atlantic',
'Florida': 'Atlantic',
'Wisconsin': 'Midwest',
'Michigan': 'Midwest',
'Ohio': 'Midwest',
'Pennsylvania': 'Midwest',
'Illinois': 'Midwest',
'Indiana': 'Midwest',
'Minnesota': 'Midwest',
'New York': 'Atlantic',
'Washington': 'West',
'Oregon': 'West',
'California': 'West'
}
You can use .apply()
again, with a slightly modified function:
def get_region(state):
return regions[state]
df['Region'] = df['State'].apply(get_region)
This time your dataframe looks like this:
State Region
0 Virginia Atlantic
1 Massachusetts Atlantic
2 Maine Atlantic
3 New York Atlantic
4 Rhode Island Atlantic
5 Connecticut Atlantic
6 New Hampshire Atlantic
7 Maryland Atlantic
8 Delaware Atlantic
9 New Jersey Atlantic
10 North Carolina Atlantic
11 South Carolina Atlantic
12 Georgia Atlantic
13 Florida Atlantic
14 Wisconsin Midwest
15 Michigan Midwest
16 Ohio Midwest
17 Pennsylvania Midwest
18 Illinois Midwest
19 Indiana Midwest
20 Minnesota Midwest
21 New York Atlantic
22 Washington West
23 Oregon West
24 California West
Upvotes: 2