Pandas efficiently cut column with bins argument based on another column

Question

I have the following pandas DataFrame:

import numpy as np
import pandas as pd

np.random.seed(0)
test_df = pd.DataFrame({"category": ["A", "B"]*5, "value": np.random.uniform(size=10)})

print(test_df)
#  category     value
#0        A  0.548814
#1        B  0.715189
#2        A  0.602763
#3        B  0.544883
#4        A  0.423655
#5        B  0.645894
#6        A  0.437587
#7        B  0.891773
#8        A  0.963663
#9        B  0.383442

I want to bin the value column using pandas.cut, but the bins parameter needs to vary based on the category column.

Specifically, I want to use the following dictionary to define what bins to use for cut:

bins = {
    "A": [0.00, 0.25, 0.50, 0.75, 1],
    #     0,    1,    2,    3,    4   <-- corresponding bin value
    "B": [0.00, 0.33, 0.66, 1]
    #     0,    1,    2,    3         <-- corresponding bin value
}

I came up with the following solution, which is to first cut the value columns using all the bins:

cuts = {
    c: pd.cut(test_df["value"], bins=bins[c], labels=range(1, len(bins[c]))) for c in bins
}

Then using numpy.select to assign the appropriate bin back to test_df:

test_df["bin"] = np.select(*zip(*[(test_df["category"] == c, cuts[c]) for c in bins]))
print(test_df)
#  category     value  bin
#0        A  0.548814    3
#1        B  0.715189    3
#2        A  0.602763    3
#3        B  0.544883    2
#4        A  0.423655    2
#5        B  0.645894    2
#6        A  0.437587    2
#7        B  0.891773    3
#8        A  0.963663    4
#9        B  0.383442    2

This is the correct answer, but is there a more efficient way? Ideally there should be a way that doesn't involve calling cut on each of the different bins. In my real-world data I have much more than 2 bins.

Scott Boston · Accepted Answer

Maybe use numpy with np.searchsorted:

test_df['bin'] = [np.searchsorted(bins[i], v) for i, v in test_df.values]

Output:

  category     value  bin
0        A  0.548814    3
1        B  0.715189    3
2        A  0.602763    3
3        B  0.544883    2
4        A  0.423655    2
5        B  0.645894    2
6        A  0.437587    2
7        B  0.891773    3
8        A  0.963663    4
9        B  0.383442    2

Timings

%timeit np.select(zip([(test_df["category"] == c, cuts[c]) for c in bins]))
1.21 ms ± 14.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

and

%timeit [np.searchsorted(bins[i], v) for i, v in test_df.values]
301 µs ± 4.14 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Pandas efficiently cut column with bins argument based on another column

Answers (2)

Related Questions