Jordan
Jordan

Reputation: 1495

How do I make bins of equal number of observations in a pandas dataframe?

I'm trying to make a column in a dataframe depicting a group or bin that observation belongs to. The idea is to sort the dataframe according to some column, then develop another column denoting which bin that observation belongs to. If I want deciles, then I should be able to tell a function I want 10 equal (or close to equal) groups.

I tried the pandas qcut but that just gives a tuples of the the upper and lower limits of the bins. I would like just 1,2,3,4....etc. Take the following for example

import numpy as np
import pandas as pd

x = [1,2,3,4,5,6,7,8,5,45,64545,65,6456,564]
y = np.random.rand(len(x))

df_dict = {'x': x, 'y': y}
df = pd.DataFrame(df_dict)

This gives a df of 14 observations. How could I get groups of 5 equal bins?

The desired result would be the following:

        x         y  group
0       1  0.926273      1
1       2  0.678101      1
2       3  0.636875      1
3       4  0.802590      2
4       5  0.494553      2
5       6  0.874876      2
6       7  0.607902      3
7       8  0.028737      3
8       5  0.493545      3
9      45  0.498140      4
10  64545  0.938377      4
11     65  0.613015      4
12   6456  0.288266      5
13    564  0.917817      5

Upvotes: 3

Views: 3243

Answers (4)

tdy
tdy

Reputation: 41327

You can split evenly with np.array_split(), assign the groups, then recombine with pd.concat():

bins = 5
splits = np.array_split(df, bins)

for i in range(len(splits)):
    splits[i]['group'] = i + 1

df = pd.concat(splits)

Or as a one-liner using assign():

df = pd.concat([d.assign(group=i+1) for i, d in enumerate(np.array_split(df, bins))])
        x         y  group
0       1  0.145781      1
1       2  0.262097      1
2       3  0.114799      1
3       4  0.275054      2
4       5  0.841606      2
5       6  0.187210      2
6       7  0.582487      3
7       8  0.019881      3
8       5  0.847115      3
9      45  0.755606      4
10  64545  0.196705      4
11     65  0.688639      4
12   6456  0.275884      5
13    564  0.579946      5

Upvotes: 2

Tom
Tom

Reputation: 8790

Here is an approach that "manually" computes the extent of the bins, based on the requested number bins:

bins = 5

l = len(df)
minbinlen = l // bins
remainder = l % bins
repeats = np.repeat(minbinlen, bins)
repeats[:remainder] += 1
group = np.repeat(range(bins), repeats) + 1

df['group'] = group

Result:

        x         y  group
0       1  0.205168      1
1       2  0.105466      1
2       3  0.545794      1
3       4  0.639346      2
4       5  0.758056      2
5       6  0.982090      2
6       7  0.942849      3
7       8  0.284520      3
8       5  0.491151      3
9      45  0.731265      4
10  64545  0.072668      4
11     65  0.601416      4
12   6456  0.239454      5
13    564  0.345006      5

This seems to follow the splitting logic of np.array_split (i.e. try to evenly split the bins, but add onto earlier bins if that isn't possible).

While the code is less concise, it doesn't use any loops, so it theoretically should be faster with larger amounts of data.

Just because I was curious, going to leave this perfplot testing here...

enter image description here

import numpy as np
import pandas as pd
import perfplot

def make_data(n):
    x = np.random.rand(n)
    y = np.random.rand(n)
    df_dict = {'x': x, 'y': y}
    df = pd.DataFrame(df_dict)

    return df

def repeat(df, bins=5):
    l = len(df)
    minbinlen = l // bins
    remainder = l % bins
    repeats = np.repeat(minbinlen, bins)
    repeats[:remainder] += 1
    group = np.repeat(range(bins), repeats) + 1

    return group

def near_split(base, num_bins):
    quotient, remainder = divmod(base, num_bins)
    return [quotient + 1] * remainder + [quotient] * (num_bins - remainder)

def array_split(df, bins=5):
    splits = np.array_split(df, bins)

    for i in range(len(splits)):
        splits[i]['group'] = i + 1

    return pd.concat(splits)

perfplot.show(
    setup = lambda n : make_data(n),
    kernels = [
        lambda df: repeat(df),
        lambda df: [i + 1 for i, v in enumerate(near_split(len(df), 5)) for _ in range(v)],
        lambda df: df.groupby(np.arange(len(df.index))//3,axis=0).ngroup()+1,
        lambda df: array_split(df)
        ],
    labels=['repeat', 'near_split', 'groupby', 'array_split'],
    n_range=[2 ** k for k in range(25)],
    equality_check=None)

Upvotes: 1

Henry Ecker
Henry Ecker

Reputation: 35626

Another option by generating list of indexes from near_split:

def near_split(base, num_bins):
    quotient, remainder = divmod(base, num_bins)
    return [quotient + 1] * remainder + [quotient] * (num_bins - remainder)


bins = 5
df['group'] = [i + 1 for i, v in enumerate(near_split(len(df), bins)) for _ in range(v)]
print(df)

Output:

        x         y  group
0       1  0.313614      1
1       2  0.765079      1
2       3  0.153851      1
3       4  0.792098      2
4       5  0.123700      2
5       6  0.239107      2
6       7  0.133665      3
7       8  0.979318      3
8       5  0.781948      3
9      45  0.264344      4
10  64545  0.495561      4
11     65  0.504734      4
12   6456  0.766627      5
13    564  0.428423      5

Upvotes: 2

wwnde
wwnde

Reputation: 26676

Group by N rows, and find ngroup

df['group']=df.groupby(np.arange(len(df.index))//3,axis=0).ngroup()+1



     x      y        group
0       1  0.548801      1
1       2  0.096620      1
2       3  0.713771      1
3       4  0.922987      2
4       5  0.283689      2
5       6  0.807755      2
6       7  0.592864      3
7       8  0.670315      3
8       5  0.034549      3
9      45  0.355274      4
10  64545  0.239373      4
11     65  0.156208      4
12   6456  0.419990      5
13    564  0.248278      5

Upvotes: 3

Related Questions