Grouping dataframe based on a gap between column values

Question

Let's say we have the following dataset:

Name    Num

First    1
Second   50
Third    110
Fourth   2
Fifth    58
Sixth    105
Seventh  8

I want to group the dataframe based on a maximum gap of (for example) 10 between each row of the num column. So, for example, I would like to have:

Name                   Num
First,Fourth,Seventh   1,2,8
Second,Fifth           50,58
Sixth,Third            105,110

I tried with Raymond Hettinger's Answer, but I need to group the dataframe, not only the column, because I want to do a barplot after. I tried with this (I sorted the dataframe based on the column, first):

    for idx,val in enumerate(df_sorted['num'], start=1):
    if idx == 1:
        groups =[df_sorted.iloc[idx]]
    elif abs(data['pos'][idx] - data['pos'][idx-1]) <= 10:
        groups[-1].append(data.iloc[idx,:])
    else:
        groups.append(data.iloc[idx,:])

But I'm having some problem understanding how lists work in Python (I always used them in R). Don't know if there is an easier way using groupby or something else (like Grouper for time series).

EDIT: At the very end, I choose the following approach to solve the problem, using Raymond Hettinger's Answer:

def cluster(data, maxgap, column):
'''Arrange data into groups where successive elements
   differ by no more than *maxgap*
'''
#data.sort() #sort data if necessary
groups = list()
groups = [[data.iloc[0,:]]]

for idx,val in enumerate(data[column]):
    if idx > 0:
        if abs(val - data[column][idx-1]) <= maxgap:
            groups[-1].append(data.iloc[idx,:])
        else:
            groups.append([data.iloc[idx,:]])
return groups

r-beginners · Accepted Answer

We binned the numbers to calculate the distribution and then grouped and listed them according to that distribution; N=24 is intentionally tailored to your expected results. Thresholds may be more difficult to set for actual operations. For example, for every 10 categories, 50 and 58 would be a different group.

import pandas as pd
import numpy as np
import io

data = '''
Name Num
First 1
Second 50
Third 110
Fourth 2
Fifth 58
Sixth 105
Seventh 8
'''

df = pd.read_csv(io.StringIO(data), sep='\s+')

df.sort_values('Num', ascending=True, inplace=True, ignore_index=True)
N = 24
category = pd.cut(df['Num'], bins=np.arange(df['Num'].min(), df['Num'].max()+ N, N), right=False)

category.value_counts()
[1, 25)      3
[97, 121)    2
[49, 73)     2
[73, 97)     0
[25, 49)     0
Name: Num, dtype: int64

df.groupby(category).agg(list)
    Name    Num
Num     
[1, 25) [First, Fourth, Seventh]    [1, 2, 8]
[25, 49)    []  []
[49, 73)    [Second, Fifth] [50, 58]
[73, 97)    []  []
[97, 121)   [Sixth, Third]  [105, 110]

Grouping dataframe based on a gap between column values

Answers (1)

Related Questions