Reputation: 43494
I have the following pandas DataFrame:
import numpy as np
import pandas as pd
np.random.seed(0)
test_df = pd.DataFrame({"category": ["A", "B"]*5, "value": np.random.uniform(size=10)})
print(test_df)
# category value
#0 A 0.548814
#1 B 0.715189
#2 A 0.602763
#3 B 0.544883
#4 A 0.423655
#5 B 0.645894
#6 A 0.437587
#7 B 0.891773
#8 A 0.963663
#9 B 0.383442
I want to bin the value
column using pandas.cut
, but the bins
parameter needs to vary based on the category
column.
Specifically, I want to use the following dictionary to define what bins to use for cut
:
bins = {
"A": [0.00, 0.25, 0.50, 0.75, 1],
# 0, 1, 2, 3, 4 <-- corresponding bin value
"B": [0.00, 0.33, 0.66, 1]
# 0, 1, 2, 3 <-- corresponding bin value
}
I came up with the following solution, which is to first cut the value
columns using all the bins:
cuts = {
c: pd.cut(test_df["value"], bins=bins[c], labels=range(1, len(bins[c]))) for c in bins
}
Then using numpy.select
to assign the appropriate bin back to test_df
:
test_df["bin"] = np.select(*zip(*[(test_df["category"] == c, cuts[c]) for c in bins]))
print(test_df)
# category value bin
#0 A 0.548814 3
#1 B 0.715189 3
#2 A 0.602763 3
#3 B 0.544883 2
#4 A 0.423655 2
#5 B 0.645894 2
#6 A 0.437587 2
#7 B 0.891773 3
#8 A 0.963663 4
#9 B 0.383442 2
This is the correct answer, but is there a more efficient way? Ideally there should be a way that doesn't involve calling cut
on each of the different bins. In my real-world data I have much more than 2 bins.
Upvotes: 3
Views: 653
Reputation: 2417
Another one to go about the problem is using groupby
def applied(x):
_bins = bins[x.category.iat[0]]
return pd.cut(x.value, bins=_bins, labels=range(1,len(_bins)))
test_df['bin']= test_df.groupby('category').apply(applied).reset_index(level= 0, drop= True)
but it's actually quite slow compared to @Scott Boston's
Upvotes: 0
Reputation: 153460
Maybe use numpy with np.searchsorted:
test_df['bin'] = [np.searchsorted(bins[i], v) for i, v in test_df.values]
Output:
category value bin
0 A 0.548814 3
1 B 0.715189 3
2 A 0.602763 3
3 B 0.544883 2
4 A 0.423655 2
5 B 0.645894 2
6 A 0.437587 2
7 B 0.891773 3
8 A 0.963663 4
9 B 0.383442 2
Timings
%timeit np.select(zip([(test_df["category"] == c, cuts[c]) for c in bins]))
1.21 ms ± 14.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
and
%timeit [np.searchsorted(bins[i], v) for i, v in test_df.values]
301 µs ± 4.14 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Upvotes: 3