Lynn Leifker
Lynn Leifker

Reputation: 95

Is there a way to bin categorical data in pandas?

I've got a dataframe where one column is U.S. states. I'd like to create a new column and bin the states according to region, i.e., South, Southwest, etc. It looks like pd.cut is only used for continuous variables, so binning that way doesn't seem like an option. Is there a good way to create a column that's conditional on categorical data in another column?

Upvotes: 2

Views: 5532

Answers (2)

Lina Alice Anderson
Lina Alice Anderson

Reputation: 111

import pandas as pd

def label_states (row):
    if row['state'] in ['Maine', 'New Hampshire', 'Vermont', 'Massachusetts', 'Rhode Island', 'Connecticut', 'New York', 'Pennsylvania', 'New Jersey']:
        return 'north-east'
    if row['state'] in ['Wisconsin', 'Michigan', 'Illinois', 'Indiana', 'Ohio', 'North Dakota', 'South Dakota', 'Nebraska', 'Kansas', 'Minnesota', 'Iowa', 'Missouri']:
        return 'midwest'
    if row['state'] in ['Delaware', 'Maryland', 'District of Columbia', 'Virginia', 'West Virginia', 'North Carolina', 'South Carolina', 'Georgia', 'Florida', 'Kentucky', 'Tennessee', 'Mississippi', 'Alabama', 'Oklahoma', 'Texas', 'Arkansas', 'Louisiana']:
        return 'south'
    return 'etc'

df = pd.DataFrame([{'state':"Illinois", 'data':"aaa"}, {'state':"Rhode Island",'data':"aba"}, {'state':"Georgia",'data':"aba"}, {'state':"Iowa",'data':"aba"}, {'state':"Connecticut",'data':"bbb"}, {'state':"Ohio",'data':"bbb"}])

df['label'] = df.apply(lambda row: label_states(row), axis=1)

df

The output:

Upvotes: 6

Valdi_Bo
Valdi_Bo

Reputation: 31011

Assume that your df contains:

  • State - US state code.
  • other columns, for the test (see below) I included only State Name.

Of course it can contain more columns and more than one row for each state.

To add region names (a new column), define regions DataFrame, containing columns:

  • State - US state code.
  • Region - Region name.

Then merge these DataFrames and save the result back under df:

df = df.merge(regions, on='State')

A part of the result is:

        State Name State              Region
0          Alabama    AL           Southeast
1          Arizona    AZ           Southwest
2         Arkansas    AR               South
3       California    CA                West
4         Colorado    CO           Southwest
5      Connecticut    CT           Northeast
6         Delaware    DE           Northeast
7          Florida    FL           Southeast
8          Georgia    GA           Southeast
9            Idaho    ID           Northwest
10        Illinois    IL             Central
11         Indiana    IN             Central
12            Iowa    IA  East North Central
13          Kansas    KS               South
14        Kentucky    KY             Central
15       Louisiana    LA               South

Of course, there are numerous variants of how to assign US states to regions, so if you want to use other variant, define regions DataFrame according to your classification.

Upvotes: 1

Related Questions