Reputation: 1495
I'm trying to make a column in a dataframe depicting a group or bin
that observation belongs to. The idea is to sort the dataframe according to some column, then develop another column denoting which bin that observation belongs to. If I want deciles, then I should be able to tell a function I want 10 equal (or close to equal) groups.
I tried the pandas qcut but that just gives a tuples of the the upper and lower limits of the bins. I would like just 1,2,3,4....etc. Take the following for example
import numpy as np
import pandas as pd
x = [1,2,3,4,5,6,7,8,5,45,64545,65,6456,564]
y = np.random.rand(len(x))
df_dict = {'x': x, 'y': y}
df = pd.DataFrame(df_dict)
This gives a df of 14 observations. How could I get groups of 5 equal bins?
The desired result would be the following:
x y group
0 1 0.926273 1
1 2 0.678101 1
2 3 0.636875 1
3 4 0.802590 2
4 5 0.494553 2
5 6 0.874876 2
6 7 0.607902 3
7 8 0.028737 3
8 5 0.493545 3
9 45 0.498140 4
10 64545 0.938377 4
11 65 0.613015 4
12 6456 0.288266 5
13 564 0.917817 5
Upvotes: 3
Views: 3243
Reputation: 41327
You can split evenly with np.array_split()
, assign the groups, then recombine with pd.concat()
:
bins = 5
splits = np.array_split(df, bins)
for i in range(len(splits)):
splits[i]['group'] = i + 1
df = pd.concat(splits)
Or as a one-liner using assign()
:
df = pd.concat([d.assign(group=i+1) for i, d in enumerate(np.array_split(df, bins))])
x y group
0 1 0.145781 1
1 2 0.262097 1
2 3 0.114799 1
3 4 0.275054 2
4 5 0.841606 2
5 6 0.187210 2
6 7 0.582487 3
7 8 0.019881 3
8 5 0.847115 3
9 45 0.755606 4
10 64545 0.196705 4
11 65 0.688639 4
12 6456 0.275884 5
13 564 0.579946 5
Upvotes: 2
Reputation: 8790
Here is an approach that "manually" computes the extent of the bins, based on the requested number bins
:
bins = 5
l = len(df)
minbinlen = l // bins
remainder = l % bins
repeats = np.repeat(minbinlen, bins)
repeats[:remainder] += 1
group = np.repeat(range(bins), repeats) + 1
df['group'] = group
Result:
x y group
0 1 0.205168 1
1 2 0.105466 1
2 3 0.545794 1
3 4 0.639346 2
4 5 0.758056 2
5 6 0.982090 2
6 7 0.942849 3
7 8 0.284520 3
8 5 0.491151 3
9 45 0.731265 4
10 64545 0.072668 4
11 65 0.601416 4
12 6456 0.239454 5
13 564 0.345006 5
This seems to follow the splitting logic of np.array_split
(i.e. try to evenly split the bins, but add onto earlier bins if that isn't possible).
While the code is less concise, it doesn't use any loops, so it theoretically should be faster with larger amounts of data.
Just because I was curious, going to leave this perfplot
testing here...
import numpy as np
import pandas as pd
import perfplot
def make_data(n):
x = np.random.rand(n)
y = np.random.rand(n)
df_dict = {'x': x, 'y': y}
df = pd.DataFrame(df_dict)
return df
def repeat(df, bins=5):
l = len(df)
minbinlen = l // bins
remainder = l % bins
repeats = np.repeat(minbinlen, bins)
repeats[:remainder] += 1
group = np.repeat(range(bins), repeats) + 1
return group
def near_split(base, num_bins):
quotient, remainder = divmod(base, num_bins)
return [quotient + 1] * remainder + [quotient] * (num_bins - remainder)
def array_split(df, bins=5):
splits = np.array_split(df, bins)
for i in range(len(splits)):
splits[i]['group'] = i + 1
return pd.concat(splits)
perfplot.show(
setup = lambda n : make_data(n),
kernels = [
lambda df: repeat(df),
lambda df: [i + 1 for i, v in enumerate(near_split(len(df), 5)) for _ in range(v)],
lambda df: df.groupby(np.arange(len(df.index))//3,axis=0).ngroup()+1,
lambda df: array_split(df)
],
labels=['repeat', 'near_split', 'groupby', 'array_split'],
n_range=[2 ** k for k in range(25)],
equality_check=None)
Upvotes: 1
Reputation: 35626
Another option by generating list of indexes from near_split
:
def near_split(base, num_bins):
quotient, remainder = divmod(base, num_bins)
return [quotient + 1] * remainder + [quotient] * (num_bins - remainder)
bins = 5
df['group'] = [i + 1 for i, v in enumerate(near_split(len(df), bins)) for _ in range(v)]
print(df)
Output:
x y group
0 1 0.313614 1
1 2 0.765079 1
2 3 0.153851 1
3 4 0.792098 2
4 5 0.123700 2
5 6 0.239107 2
6 7 0.133665 3
7 8 0.979318 3
8 5 0.781948 3
9 45 0.264344 4
10 64545 0.495561 4
11 65 0.504734 4
12 6456 0.766627 5
13 564 0.428423 5
Upvotes: 2
Reputation: 26676
Group by N
rows, and find ngroup
df['group']=df.groupby(np.arange(len(df.index))//3,axis=0).ngroup()+1
x y group
0 1 0.548801 1
1 2 0.096620 1
2 3 0.713771 1
3 4 0.922987 2
4 5 0.283689 2
5 6 0.807755 2
6 7 0.592864 3
7 8 0.670315 3
8 5 0.034549 3
9 45 0.355274 4
10 64545 0.239373 4
11 65 0.156208 4
12 6456 0.419990 5
13 564 0.248278 5
Upvotes: 3