spiral01
spiral01

Reputation: 545

pandas: bin data into specific number of bins of specific size

I would like to bin a dataframe by the values in a single column into bins of a specific size and number.

Here is an example df:

df= pd.DataFrame(np.random.randint(0,10000,size=(10000, 4)), columns=list('ABCD'))

Say I want to bin by column D, I will first sort the data:

df.sort('D')

I would now wish to bin so that the first if bin size is 50 and bin number is 100, the first 50 values will go into bin 1, the next into bin 2, and so on and so forth. Any remaining values after the twenty bins should all go into the final bin. Is there anyway of doing this?

EDIT:

Here is a sample input:

x = pd.DataFrame(np.random.randint(0,10,size=(10, 4)), columns=list('ABCD'))

And here is the expected output:

    A   B   C   D   bin
0   6   8   6   5   3
1   5   4   9   1   1
2   5   1   7   4   3
3   6   3   3   3   2
4   2   5   9   3   2
5   2   5   1   3   2
6   0   1   1   0   1
7   3   9   5   8   3
8   2   4   0   1   1
9   6   4   5   6   3

As an extra aside, is it also possible to bin any equal values in the same bin? So for example, say I have bin 1 which contains values, 0,1,1 and then bin 2 contains 1,1,2. Is there any way of putting those two 1 values in bin 2 into bin 1? This will create very uneven bin sizes but this is not an issue.

Upvotes: 1

Views: 1262

Answers (1)

jezrael
jezrael

Reputation: 863801

It seems you need floor divide np.arange and then assign to new column:

idx = df['D'].sort_values().index
df['b'] = pd.Series(np.arange(len(df)) // 3 + 1, index = idx)
print (df)
   A  B  C  D  bin  b
0  6  8  6  5    3  3
1  5  4  9  1    1  1
2  5  1  7  4    3  3
3  6  3  3  3    2  2
4  2  5  9  3    2  2
5  2  5  1  3    2  2
6  0  1  1  0    1  1
7  3  9  5  8    3  4
8  2  4  0  1    1  1
9  6  4  5  6    3  3

Detail:

print (np.arange(len(df)) // 3 + 1)
[1 1 1 2 2 2 3 3 3 4]

EDIT:

I create another question about problem with last values here:

N = 3
idx = df['D'].sort_values().index

#one possible solution, thanks divakar
def replace_irregular_groupings(a, N):
    n = len(a)
    m = N*(n//N)
    if m!=n:
        a[m:] = a[m-1]
    return a

idx = df['D'].sort_values().index
arr = replace_irregular_groupings(np.arange(len(df)) // N + 1, N)
df['b'] = pd.Series(arr, index = idx)
print (df)

   A  B  C  D  bin  b
0  6  8  6  5    3  3
1  5  4  9  1    1  1
2  5  1  7  4    3  3
3  6  3  3  3    2  2
4  2  5  9  3    2  2
5  2  5  1  3    2  2
6  0  1  1  0    1  1
7  3  9  5  8    3  3
8  2  4  0  1    1  1
9  6  4  5  6    3  3

Upvotes: 1

Related Questions