Reputation: 95
I've got a dataframe where one column is U.S. states. I'd like to create a new column and bin the states according to region, i.e., South, Southwest, etc. It looks like pd.cut is only used for continuous variables, so binning that way doesn't seem like an option. Is there a good way to create a column that's conditional on categorical data in another column?
Upvotes: 2
Views: 5532
Reputation: 111
import pandas as pd
def label_states (row):
if row['state'] in ['Maine', 'New Hampshire', 'Vermont', 'Massachusetts', 'Rhode Island', 'Connecticut', 'New York', 'Pennsylvania', 'New Jersey']:
return 'north-east'
if row['state'] in ['Wisconsin', 'Michigan', 'Illinois', 'Indiana', 'Ohio', 'North Dakota', 'South Dakota', 'Nebraska', 'Kansas', 'Minnesota', 'Iowa', 'Missouri']:
return 'midwest'
if row['state'] in ['Delaware', 'Maryland', 'District of Columbia', 'Virginia', 'West Virginia', 'North Carolina', 'South Carolina', 'Georgia', 'Florida', 'Kentucky', 'Tennessee', 'Mississippi', 'Alabama', 'Oklahoma', 'Texas', 'Arkansas', 'Louisiana']:
return 'south'
return 'etc'
df = pd.DataFrame([{'state':"Illinois", 'data':"aaa"}, {'state':"Rhode Island",'data':"aba"}, {'state':"Georgia",'data':"aba"}, {'state':"Iowa",'data':"aba"}, {'state':"Connecticut",'data':"bbb"}, {'state':"Ohio",'data':"bbb"}])
df['label'] = df.apply(lambda row: label_states(row), axis=1)
df
Upvotes: 6
Reputation: 31011
Assume that your df contains:
Of course it can contain more columns and more than one row for each state.
To add region names (a new column), define regions DataFrame, containing columns:
Then merge these DataFrames and save the result back under df:
df = df.merge(regions, on='State')
A part of the result is:
State Name State Region
0 Alabama AL Southeast
1 Arizona AZ Southwest
2 Arkansas AR South
3 California CA West
4 Colorado CO Southwest
5 Connecticut CT Northeast
6 Delaware DE Northeast
7 Florida FL Southeast
8 Georgia GA Southeast
9 Idaho ID Northwest
10 Illinois IL Central
11 Indiana IN Central
12 Iowa IA East North Central
13 Kansas KS South
14 Kentucky KY Central
15 Louisiana LA South
Of course, there are numerous variants of how to assign US states to regions, so if you want to use other variant, define regions DataFrame according to your classification.
Upvotes: 1