Clustering intervals

Question

Each of the rows of my dataframe is an interval represented by date1 and date2 and a user id. For each user id, I need to group together the intervals which are separated by a gap below a certain threshold.

So far, for each user id, I sort rows by begin and end date. Then, I compute gaps and group rows based on those values. Then, I add the modified rows to a new dataframe (this is the way I found to un-group the dataframe).

However, this is quite slow. Do you see ways to improve the way I do the grouping?

def gap(group):
    return group[['date1', 'date2']].min(axis = 1) - \
        group.shift()[['date1', 'date2']].max(axis = 1)

def cluster(df, threshold):
    df['clusters'] = 0
    grouped = df.groupby('user_id')
    newdf = pd.DataFrame()
    for name, group in grouped:
        group = group.sort_values(['date1', 'date2'], ascending = True)
        group['gap'] = gap(group)
        cuts = group['gap'] > timedelta(threshold)
        df2 = group.copy()
        for g, d, r in zip(group.loc[cuts, 'gap'], group.loc[cuts, 'date1'], group.loc[cuts, 'date2']):
            df2.loc[((df2['date1'] >= d) & (df2['date2'] >= r)), 'clusters'] +=1
        df2 = df2.drop('gap', axis = 1)
        newdf = pd.concat([newdf, df2])
    return newdf

Here is a minimal sample of the data it uses:

 df = pd.DataFrame(dict([('user_id', np.array(['a', 'a', 'a', 'a', 'a', 'a', 'a'])), 
    ('date1', np.array([datetime.strptime(x, "%y%m%d") for x in ['160101', '160103', '160110', '160120', '160130', '160308', '160325']])),
    ('date2', np.array([datetime.strptime(x, "%y%m%d") for x in ['160107', '160109', '160115', '160126', '160206', '160314', '160402']]))]))

IanS · Accepted Answer

A simple improvement would be to use cumsum on the boolean vector cuts:

def cluster2(df, threshold):
    df['clusters'] = 0
    grouped = df.groupby('user_id')
    df_list = []
    for name, group in grouped:
        group = group.sort_values(['date1', 'date2'], ascending = True)
        group['gap'] = gap(group)
        print(group)
        cuts = group['gap'] > timedelta(threshold)
        df2 = group.copy()
        df2['clusters'] = cuts.cumsum()
        df_list.append(df2)
    return pd.concat(df_list)

Edit: following OP's comment, I moved concatenation out of the loop to improve performance.

A further improvement could be to not sort the groups in the groupby operation (if there are many users):

    grouped = df.groupby('user_id', sort=False)

Or even grouping manually by sorting df by user_id and then adding a condition to cuts directly on the original dataframe:

df = df.sort_values(['user_id', 'date1', 'date2'], ascending = True)
df['gap'] = gap(df)
cuts = (df['user_id'] != df['user_id'].shift()) | (df['gap'] > timedelta(threshold))
df['clusters'] = cuts.cumsum()

Clustering intervals

Answers (1)

Related Questions