ProficientInMath
ProficientInMath

Reputation: 69

New column in pandas dataframe based on existing column values

I have a dataframe with a column named 'States' that lists various U.S. states. I need to create another column with a region specifier like 'Atlantic Coast' I have lists of the states that belong to various regions so if the state in df['States'] matches a state in the list 'Atlantic_states' the specifier 'Atlantic Coast' is inserted into the new column df['region specifier'] my code below shows the list I want to compare my dataframe values with and the output of the df['States'] column.

 #list of states
 Atlantic_states = ['Virginia',
              'Massachusetts',
              'Maine',
              'New York',
              'Rhode Island',
              'Connecticut',
              'New Hampshire',
              'Maryland',
              'Delaware',
              'New Jersey',
              'North Carolina',
              'South Carolina',
              'Georgia',
              'Florida']
 print(df['States'])

 Out:
                 States
       0         Virginia
       1    Massachusetts
       2            Maine
       3         New York
       4     Rhode Island
       5      Connecticut
       6    New Hampshire
       7         Maryland
       8         Delaware
       9       New Jersey
       10  North Carolina
       11  South Carolina
       12         Georgia
       13         Florida
       14       Wisconsin
       15        Michigan
       16            Ohio
       17    Pennsylvania
       18        Illinois
       19         Indiana
       20       Minnesota
       21        New York
       22      Washington
       23          Oregon
       24      California

Upvotes: 2

Views: 1800

Answers (2)

Little Bobby Tables
Little Bobby Tables

Reputation: 4744

Whilst Andy's answer works it is not the most efficient way of doing this. There is a handy method that can be called on almost all pandas Series-like objects: .isin(). Entries to this can be lists, dicts and pandas Series.

df = pd.DataFrame(['Virginia','Massachusetts','Maine','New York','Rhode Island',
                   'Connecticut','New Hampshire','Maryland', 'Delaware',
                   'New Jersey','North Carolina', 'South Carolina','Georgia','Florida',
                   'Wisconsin','Michigan', 'Ohio','Pennsylvania','Illinois',
                   'Indiana','Minnesota','New York','Washington','Oregon',
                   'California'],
                  columns=['States'])

Atlantic_states = ['Virginia', 'Massachusetts', 'Maine', 'New York','Rhode Island',
                   'Connecticut', 'New Hampshire',  'Maryland', 'Delaware',
                   'New Jersey', 'North Carolina', 'South Carolina', 'Georgia',
                   'Florida']

df['Coast'] = np.where(df['States'].isin(Atlantic_states), 'Atlantic Coast',
                       'Unknown')
df.head()

Out[1]: 

    States          Coast
0   Virginia        Atlantic Coast
1   Massachusetts   Atlantic Coast
2   Maine           Atlantic Coast
3   New York        Atlantic Coast
4   Rhode Island    Atlantic Coast

Benchmarks

Here are some timings using for mapping the first 10 letters of the alphabet to some random int numbers:

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randint(low=0, high=26, size=(1000000,1)),
                  columns=['numbers'])
letters = dict(zip(list(range(0, 10)), [i for i in 'abcdefghij']))

for apply

%%timeit
def is_atlantic(state):
    return True if state in letters else False

df.numbers.apply(is_atlantic)

Out[]: 1 loops, best of 3: 432 ms per loop

Now for map as suggested by JohnE

%%timeit
df.numbers.map(letters)

Out[]: 10 loops, best of 3: 56.9 ms per loop

and finally for isin (also suggested by Nickil Maveli)

%%timeit 
df.numbers.isin(letters)

Out[]: 10 loops, best of 3: 20.9 ms per loop

So we see that .isin() is much quicker than .apply() and twice as quick as .map().

Note: apply and isin just return the boolean masks and map fills with the desired strings. Even so, when assigning to another column isin wins out by about 2/3 of the time of map.

Upvotes: 3

Andy
Andy

Reputation: 50560

You have a couple options. First, to directly answer the question as posed:

Option 1

Create a function that returns whether a state is in the Atlantic region or not

def is_atlantic(state):
    return "Atlantic" if state in Atlantic_states else "Unknown"

Now, you use .apply() and get the results (and return it to your new column)

df['Region'] = df['State'].apply(is_atlantic)

This returns a data frame that looks like this:

    State           Region
0   Virginia        Atlantic
1   Massachusetts   Atlantic
2   Maine           Atlantic
3   New York        Atlantic
4   Rhode Island    Atlantic
5   Connecticut     Atlantic
6   New Hampshire   Atlantic
7   Maryland        Atlantic
8   Delaware        Atlantic
9   New Jersey      Atlantic
10  North Carolina  Atlantic
11  South Carolina  Atlantic
12  Georgia         Atlantic
13  Florida         Atlantic
14  Wisconsin       Unknown
15  Michigan        Unknown
16  Ohio            Unknown
17  Pennsylvania    Unknown
18  Illinois        Unknown
19  Indiana         Unknown
20  Minnesota       Unknown
21  New York        Atlantic
22  Washington      Unknown
23  Oregon          Unknown
24  California      Unknown

Option 2

The first option gets cumbersome if you have multiple lists you want to check against. Instead of having multiple lists, I recommend creating a single dictionary with the State as the key and the region as the value. With only 50 values this should be easy enough to maintain.

    regions = {
    'Virginia': 'Atlantic',
    'Massachusetts': 'Atlantic',
    'Maine': 'Atlantic',
    'New York': 'Atlantic',
    'Rhode Island': 'Atlantic',
    'Connecticut': 'Atlantic',
    'New Hampshire': 'Atlantic',
    'Maryland': 'Atlantic',
    'Delaware': 'Atlantic',
    'New Jersey': 'Atlantic',
    'North Carolina': 'Atlantic',
    'South Carolina': 'Atlantic',
    'Georgia': 'Atlantic',
    'Florida': 'Atlantic',
    'Wisconsin': 'Midwest',
    'Michigan': 'Midwest',
    'Ohio': 'Midwest',
    'Pennsylvania': 'Midwest',
    'Illinois': 'Midwest',
    'Indiana': 'Midwest',
    'Minnesota': 'Midwest',
    'New York': 'Atlantic',
    'Washington': 'West',
    'Oregon': 'West',
    'California': 'West'
}

You can use .apply() again, with a slightly modified function:

def get_region(state):
    return regions[state]

df['Region'] = df['State'].apply(get_region)

This time your dataframe looks like this:

    State           Region
0   Virginia        Atlantic
1   Massachusetts   Atlantic
2   Maine           Atlantic
3   New York        Atlantic
4   Rhode Island    Atlantic
5   Connecticut     Atlantic
6   New Hampshire   Atlantic
7   Maryland        Atlantic
8   Delaware        Atlantic
9   New Jersey      Atlantic
10  North Carolina  Atlantic
11  South Carolina  Atlantic
12  Georgia         Atlantic
13  Florida         Atlantic
14  Wisconsin       Midwest
15  Michigan        Midwest
16  Ohio            Midwest
17  Pennsylvania    Midwest
18  Illinois        Midwest
19  Indiana         Midwest
20  Minnesota       Midwest
21  New York        Atlantic
22  Washington      West
23  Oregon          West
24  California      West

Upvotes: 2

Related Questions