Reputation: 210
I have a pandas dataframe with 7 columns. For one of these columns, I want to divide its content into n-buckets depending only on the values. So, if my column has values 1, 3, 5 ... (2*n+1)
, I add a new column with buckets as 1,2,3...n
.
Also, I'm not looking to normalize in the sense that even if I have a 100 3's in the column, I want them in the same bucket. So, if I have 1, 3, 3, 3, 5, ... (2*n+1)
, my output would be 1, 2, 2, 2, 3, .. n
.
Can someone please guide me how to do it.
--edit--
My actual data has more than a million rows. So, if I use rank I get a rank from a 1 to a million. What I want is to divide the ranks into buckets. So for example if I have 3 million rows and end up with ranks from 1 to 1.5 million. If I want to divide it into 3 buckets, 1 get first bucket with first 0.5 million rank, 2nd bucket with the next half million and so on. Similarly if I want to divide it into 7 buckets.
Regards
Upvotes: 1
Views: 7747
Reputation: 862581
You need rank
:
df = pd.DataFrame({'col':[1,5,3,9,5,3,7,10]})
print (df)
col
0 1
1 5
2 3
3 9
4 5
5 3
6 7
7 10
df['col1'] = df.col.rank(method='dense').astype(int)
print (df)
col col1
0 1 1
1 5 3
2 3 2
3 9 5
4 5 3
5 3 2
6 7 4
7 10 6
EDIT: I think you need floor division //
:
n = 3
df['col1'] = np.arange(len(df.index)) // n
print (df)
col col1
0 1 0
1 7 0
2 3 0
3 3 1
4 5 1
5 7 1
6 13 2
If monotonic increasing index like 0,1,2...n
:
n = 3
df['col1'] = df.index // n
print (df)
col col1
0 1 0
1 7 0
2 3 0
3 3 1
4 5 1
5 7 1
6 13 2
Upvotes: 0
Reputation: 57033
Pandas has function cut()
for this sort of binning:
data=pd.Series([1,3,3,3,5,7,13])
n_buckets = (data.max() - data.min()) // 2 + 1
buckets = pd.cut(data, n_buckets, labels=False) + 1
#0 1
#1 2
#2 2
#3 2
#4 3
#5 4
#6 7
Upvotes: 4