Reputation: 163
Let's say we have the following dataset:
Name Num
First 1
Second 50
Third 110
Fourth 2
Fifth 58
Sixth 105
Seventh 8
I want to group the dataframe based on a maximum gap of (for example) 10 between each row of the num column. So, for example, I would like to have:
Name Num
First,Fourth,Seventh 1,2,8
Second,Fifth 50,58
Sixth,Third 105,110
I tried with Raymond Hettinger's Answer, but I need to group the dataframe, not only the column, because I want to do a barplot after. I tried with this (I sorted the dataframe based on the column, first):
for idx,val in enumerate(df_sorted['num'], start=1):
if idx == 1:
groups =[df_sorted.iloc[idx]]
elif abs(data['pos'][idx] - data['pos'][idx-1]) <= 10:
groups[-1].append(data.iloc[idx,:])
else:
groups.append(data.iloc[idx,:])
But I'm having some problem understanding how lists work in Python (I always used them in R). Don't know if there is an easier way using groupby or something else (like Grouper for time series).
EDIT: At the very end, I choose the following approach to solve the problem, using Raymond Hettinger's Answer:
def cluster(data, maxgap, column):
'''Arrange data into groups where successive elements
differ by no more than *maxgap*
'''
#data.sort() #sort data if necessary
groups = list()
groups = [[data.iloc[0,:]]]
for idx,val in enumerate(data[column]):
if idx > 0:
if abs(val - data[column][idx-1]) <= maxgap:
groups[-1].append(data.iloc[idx,:])
else:
groups.append([data.iloc[idx,:]])
return groups
Upvotes: 0
Views: 443
Reputation: 35135
We binned the numbers to calculate the distribution and then grouped and listed them according to that distribution; N=24 is intentionally tailored to your expected results. Thresholds may be more difficult to set for actual operations. For example, for every 10 categories, 50 and 58 would be a different group.
import pandas as pd
import numpy as np
import io
data = '''
Name Num
First 1
Second 50
Third 110
Fourth 2
Fifth 58
Sixth 105
Seventh 8
'''
df = pd.read_csv(io.StringIO(data), sep='\s+')
df.sort_values('Num', ascending=True, inplace=True, ignore_index=True)
N = 24
category = pd.cut(df['Num'], bins=np.arange(df['Num'].min(), df['Num'].max()+ N, N), right=False)
category.value_counts()
[1, 25) 3
[97, 121) 2
[49, 73) 2
[73, 97) 0
[25, 49) 0
Name: Num, dtype: int64
df.groupby(category).agg(list)
Name Num
Num
[1, 25) [First, Fourth, Seventh] [1, 2, 8]
[25, 49) [] []
[49, 73) [Second, Fifth] [50, 58]
[73, 97) [] []
[97, 121) [Sixth, Third] [105, 110]
Upvotes: 1