Reputation: 1819
I have a pandas.DataFrame
of the form
low_bound high_bound name
0 10 'a'
10 20 'b'
20 30 'c'
30 40 'd'
40 50 'e'
I have a very long pandas.Series
of the form:
value
5.7
30.4
21
35.1
I want to give to each value of the Series its corresponding name with respect to the low_bound/high_bound/name DataFrame. Here is my expected result:
value name
5.7 'a'
30.4 'd'
21 'c'
35.1 'd'
Indeed, 5.7 name is 'a' since 5.7 is between 0 and 10 excluded.
What would be the most efficient code? I know I can solve the problem by iterating through the Series, but maybe there is a quicker vectorial solution which is escaping me.
Note finally that my bounds can be custom and irregular. Here they are regular for the sake of the example.
Upvotes: 6
Views: 13485
Reputation: 7390
You can do this
buckets = [0, 10, 20, 30, 40]
buckets_name = ['a', 'b', 'c', 'd']
pd.cut(your_series, buckets , labels = buckets_name)
Upvotes: 3
Reputation: 10150
Pandas has a method called cut
that will do what you want:
import pandas as pd
data = [{"low": 0, "high": 10, "name": "a"},
{"low": 10, "high": 20, "name": "b"},
{"low": 20, "high": 30, "name": "c"},
{"low": 30, "high": 40, "name": "d"},
{"low": 40, "high": 50, "name": "e"},]
myDF = pd.DataFrame(data)
#data to be binned
mySeries = pd.Series([5.7, 30.4, 21, 35.1])
#create bins from original data
bins = list(myDF["high"])
bins.insert(0,0)
print pd.cut(mySeries, bins, labels = myDF["name"])
That will give you the following, which you can then put back into some dataframe or however you want to hold your data:
0 a
1 d
2 c
3 d
dtype: category
Categories (5, object): [a < b < c < d < e]
Depending on how irregular your bins are (and what you mean exactly by custom/irregular), you might have to resort to looping through the series. I can't think off the top of my head of a builtin that will handle this for you, especially given that it depends on the degree/type of irregularity in the bins.
Looping wise, this method will work if you have a lower and upper bound, regardless of "regularity":
for el in mySeries:
print myDF["name"][(myDF["low"] < el) & (myDF["high"] > el)]
I appreciate that you might not want to loop through a huge series, but at least we're not manually indexing into the dataframe, which would probably make things even slower
Upvotes: 6