How do I assign a group # to a set of rows in a pandas data frame?

Question

A dataframe has a time column with int values that start at zero. I want to group my data frame into 100 groups (for example) where the step is ts = df['time'].max()/100. One naive way to do it, is to test each value of the 'time' column if is greater than t and less than t+ts, where t is a np.linspace vector that starts at 0 and ends at df['time'].max().

Here is what my data frame looks like:

df.head()
   0  1  2           3      time
0  1  1  1  1130165891  59559371
1  2  1  1  1158784502  88177982
2  2  1  1  1158838664  88232144
3  2  1  1  1158838931  88232411
4  2  1  1  1158839132  88232612

user2285236 · Accepted Answer

You can use pd.cut to generate the groups:

df.groupby(pd.cut(df['time'], 2)).mean()
Out: 
                            0  1  2           3      time
time                                                     
(59530697.759, 73895991.5]  1  1  1  1130165891  59559371
(73895991.5, 88232612]      2  1  1  1158825307  88218787

This has only 2 groups and starts at the minimum because the dataset is very small. You can change the number of groups. Instead of passing the number of groups, you can pass the break points as well (with our without np.linspace).

df.groupby(pd.cut(df['time'], [0, 6*10**7, np.inf], include_lowest=True)).mean()
Out: 
                 0  1  2           3      time
time                                          
[0, 60000000]    1  1  1  1130165891  59559371
(60000000, inf]  2  1  1  1158825307  88218787

I took the mean in both examples to show you how it works. You can use a different method on the groupby object.

How do I assign a group # to a set of rows in a pandas data frame?

Answers (1)

Related Questions