user53526356
user53526356

Reputation: 968

"Bin labels must be one fewer than the number of bin edges" after passing pd.qcut duplicates='drop' kwarg

I have a df with hundreds of thousands of rows, and am creating a new dataframe which only contains the top quantile of rows for some group of values:

quantiles = (df.groupby(['Person', 'Date'])['Value'].apply(lambda x: pd.qcut(x, 4, labels=[0, 0.25, 0.5, 1], duplicates='drop')))

When I run it, I get:

ValueError: Bin labels must be one fewer than the number of bin edges

After trying to change the number of bins to 5 I still get the same error.

How can I fix this?

Upvotes: 12

Views: 34533

Answers (3)

lzy
lzy

Reputation: 1

follow up with Amit Bidlan's answer, if we want the labels to be "int" instead of the origin "string of range", because of some drawing reasons,

i would do like this:

train['range_ori'] = pd.qcut(train['MasVnrArea'], q=5, duplicates='drop')
grouplist = train["range_ori"].unique().tolist()
grouplist.sort()
group_mapping = {g:i for i, g in enumerate(grouplist)}
train['group'] = train['range_ori'].apply(lambda x: group_mapping[x])

the advantage is you can handle dynamic bins length and no need to check it before setting the label~

Upvotes: 0

Raushan Kumar
Raushan Kumar

Reputation: 191

It's happening because, after removing duplicates form some of the groups, the passed label size is more than the number of cut params you have passed. You have to make it dynamic something like wherever it's arising generate labels that much only.

Method : 1

df.groupby(key)['sales'].transform(lambda x: pd.qcut(x, min(len(x),3), labels=range(1,min(len(x),3)+1), duplicates = 'drop'))

here I'm making max 3 cut but if not possible make at least 1 or 2

Mathod : 2

import pandas as pd
df['rank'] = df.groupby([key])[sales].transform(lambda x: x.rank(method = 'first'))

def q_cut(x, cuts):
    unique_var = len(np.unique([i for i in x]))
    labels1 = range(1, min(unique_var,cuts)+1)
    output = pd.qcut(x, min(unique_var,cuts), labels= labels1, duplicates = 'drop')
    return output

df.groupby([key])['rank'].transform(lambda x: q_cut(x,10))`

Upvotes: 1

Amit Bidlan
Amit Bidlan

Reputation: 81

I was facing the same issue and I did this to overcome it.

bins = number of times the data is being sliced

labels = the range you are categorizing using labels.

This error appears when labels > bins

follow these steps:

Step. 1: Don't pass labels initially

train['MasVnrArea'] = pd.qcut(train['MasVnrArea'],
                          q=5,duplicates='drop')

This will result in:

(-0.001, 16.0]     880
(205.2, 1600.0]    292
(16.0, 205.2]      288
Name: MasVnrArea, dtype: int64

Step 2:

Now we can see there are only three categories which are possible on binned. So, assign labels accordingly. In my case, it is 3. So I am passing 3 labels.

bin_labels_MasVnrArea = ['Platinum_MasVnrArea', 
                         'Diamond_MasVnrArea','Supreme_MasVnrArea']
train['MasVnrArea'] = pd.qcut(train['MasVnrArea'],
                              q=5,labels=bin_labels_MasVnrArea,duplicates='drop')

Please watch this video on bins for a clear understanding.

https://www.youtube.com/watch?v=HofOMf8RgjM

Upvotes: 7

Related Questions