Reputation: 968
I have a df
with hundreds of thousands of rows, and am creating a new dataframe which only contains the top quantile of rows for some group of values:
quantiles = (df.groupby(['Person', 'Date'])['Value'].apply(lambda x: pd.qcut(x, 4, labels=[0, 0.25, 0.5, 1], duplicates='drop')))
When I run it, I get:
ValueError: Bin labels must be one fewer than the number of bin edges
After trying to change the number of bins
to 5
I still get the same error.
How can I fix this?
Upvotes: 12
Views: 34533
Reputation: 1
follow up with Amit Bidlan's answer, if we want the labels to be "int" instead of the origin "string of range", because of some drawing reasons,
i would do like this:
train['range_ori'] = pd.qcut(train['MasVnrArea'], q=5, duplicates='drop')
grouplist = train["range_ori"].unique().tolist()
grouplist.sort()
group_mapping = {g:i for i, g in enumerate(grouplist)}
train['group'] = train['range_ori'].apply(lambda x: group_mapping[x])
the advantage is you can handle dynamic bins length and no need to check it before setting the label~
Upvotes: 0
Reputation: 191
It's happening because, after removing duplicates form some of the groups, the passed label size is more than the number of cut params you have passed. You have to make it dynamic something like wherever it's arising generate labels that much only.
Method : 1
df.groupby(key)['sales'].transform(lambda x: pd.qcut(x, min(len(x),3), labels=range(1,min(len(x),3)+1), duplicates = 'drop'))
here I'm making max 3 cut but if not possible make at least 1 or 2
Mathod : 2
import pandas as pd
df['rank'] = df.groupby([key])[sales].transform(lambda x: x.rank(method = 'first'))
def q_cut(x, cuts):
unique_var = len(np.unique([i for i in x]))
labels1 = range(1, min(unique_var,cuts)+1)
output = pd.qcut(x, min(unique_var,cuts), labels= labels1, duplicates = 'drop')
return output
df.groupby([key])['rank'].transform(lambda x: q_cut(x,10))`
Upvotes: 1
Reputation: 81
I was facing the same issue and I did this to overcome it.
bins = number of times the data is being sliced
labels = the range you are categorizing using labels.
This error appears when labels > bins
follow these steps:
Step. 1: Don't pass labels initially
train['MasVnrArea'] = pd.qcut(train['MasVnrArea'],
q=5,duplicates='drop')
This will result in:
(-0.001, 16.0] 880
(205.2, 1600.0] 292
(16.0, 205.2] 288
Name: MasVnrArea, dtype: int64
Step 2:
Now we can see there are only three categories which are possible on binned. So, assign labels accordingly. In my case, it is 3. So I am passing 3 labels.
bin_labels_MasVnrArea = ['Platinum_MasVnrArea',
'Diamond_MasVnrArea','Supreme_MasVnrArea']
train['MasVnrArea'] = pd.qcut(train['MasVnrArea'],
q=5,labels=bin_labels_MasVnrArea,duplicates='drop')
Please watch this video on bins for a clear understanding.
https://www.youtube.com/watch?v=HofOMf8RgjM
Upvotes: 7