sweeeeeet
sweeeeeet

Reputation: 1819

Classify data by value in pandas

I have a pandas.DataFrame of the form

low_bound   high_bound   name
0           10           'a'
10          20           'b'
20          30           'c'
30          40           'd'
40          50           'e'

I have a very long pandas.Series of the form:

value
5.7
30.4
21
35.1

I want to give to each value of the Series its corresponding name with respect to the low_bound/high_bound/name DataFrame. Here is my expected result:

value         name
5.7           'a'
30.4          'd'
21            'c'
35.1          'd'

Indeed, 5.7 name is 'a' since 5.7 is between 0 and 10 excluded.

What would be the most efficient code? I know I can solve the problem by iterating through the Series, but maybe there is a quicker vectorial solution which is escaping me.

Note finally that my bounds can be custom and irregular. Here they are regular for the sake of the example.

Upvotes: 6

Views: 13485

Answers (2)

sparkle
sparkle

Reputation: 7390

You can do this

buckets = [0, 10, 20, 30, 40]
buckets_name = ['a', 'b', 'c', 'd']

pd.cut(your_series, buckets , labels = buckets_name)

Upvotes: 3

Simon
Simon

Reputation: 10150

Pandas has a method called cut that will do what you want:

import pandas as pd

data = [{"low": 0, "high": 10, "name": "a"},
        {"low": 10, "high": 20, "name": "b"},
        {"low": 20, "high": 30, "name": "c"},
        {"low": 30, "high": 40, "name": "d"},
        {"low": 40, "high": 50, "name": "e"},]

myDF = pd.DataFrame(data)

#data to be binned
mySeries = pd.Series([5.7, 30.4, 21, 35.1])

#create bins from original data
bins = list(myDF["high"])
bins.insert(0,0)

print pd.cut(mySeries, bins, labels = myDF["name"])

That will give you the following, which you can then put back into some dataframe or however you want to hold your data:

0    a
1    d
2    c
3    d
dtype: category
Categories (5, object): [a < b < c < d < e]

Depending on how irregular your bins are (and what you mean exactly by custom/irregular), you might have to resort to looping through the series. I can't think off the top of my head of a builtin that will handle this for you, especially given that it depends on the degree/type of irregularity in the bins.

Looping wise, this method will work if you have a lower and upper bound, regardless of "regularity":

for el in mySeries:
    print myDF["name"][(myDF["low"] < el) & (myDF["high"] > el)]

I appreciate that you might not want to loop through a huge series, but at least we're not manually indexing into the dataframe, which would probably make things even slower

Upvotes: 6

Related Questions