Dividing pandas dataframe column into n buckets

Question

I have a pandas dataframe with 7 columns. For one of these columns, I want to divide its content into n-buckets depending only on the values. So, if my column has values 1, 3, 5 ... (2*n+1) , I add a new column with buckets as 1,2,3...n.

Also, I'm not looking to normalize in the sense that even if I have a 100 3's in the column, I want them in the same bucket. So, if I have 1, 3, 3, 3, 5, ... (2*n+1), my output would be 1, 2, 2, 2, 3, .. n.

Can someone please guide me how to do it.

--edit--

My actual data has more than a million rows. So, if I use rank I get a rank from a 1 to a million. What I want is to divide the ranks into buckets. So for example if I have 3 million rows and end up with ranks from 1 to 1.5 million. If I want to divide it into 3 buckets, 1 get first bucket with first 0.5 million rank, 2nd bucket with the next half million and so on. Similarly if I want to divide it into 7 buckets.

Regards

jezrael · Accepted Answer

You need rank:

df = pd.DataFrame({'col':[1,5,3,9,5,3,7,10]})
print (df)
   col
0    1
1    5
2    3
3    9
4    5
5    3
6    7
7   10

df['col1'] = df.col.rank(method='dense').astype(int)
print (df)
   col  col1
0    1     1
1    5     3
2    3     2
3    9     5
4    5     3
5    3     2
6    7     4
7   10     6

EDIT: I think you need floor division //:

n = 3
df['col1'] = np.arange(len(df.index)) // n
print (df)
   col  col1
0    1     0
1    7     0
2    3     0
3    3     1
4    5     1
5    7     1
6   13     2

If monotonic increasing index like 0,1,2...n:

n = 3
df['col1'] = df.index // n
print (df)
   col  col1
0    1     0
1    7     0
2    3     0
3    3     1
4    5     1
5    7     1
6   13     2

Dividing pandas dataframe column into n buckets

Answers (2)

Related Questions