melon_lord
melon_lord

Reputation: 101

Group rows by certain timeperiod dending on other factors

What I start with is a large dataframe (more than a million entires) of this structure:

id  datetime             indicator  other_values ...
1   2020-01-14 00:12:00  0          ...
1   2020-01-17 00:23:00  1          ...
  ... 
1   2021-02-01 00:00:00  0          ...
2   2020-01-15 00:05:00  0          ...
2   2020-03-10 00:07:00  0          ...
  ... 
2   2021-05-22 00:00:00  1          ...
  ... 

There is no specific order other than a sort by id and then datetime. The dataset is not complete (there is not data for every day, but there can be multiple entires of the same day).

Now for each time where indicator==1 I want to collect every row with the same id and a datetime that is at most 10 days before. All other rows which are not in range of the indicator can be dropped. In the best case I want it to be saved as a dataset of time series which each will be later used in a Neural network. (There can be more than one indicator==1 case per id, other values should be saved).

An example for one id: I want to convert this

id  datetime             indicator  other_values ...
1   2020-01-14 00:12:00  0          ...
1   2020-01-17 00:23:00  1          ...
1   2020-01-17 00:13:00  0          ...
1   2020-01-20 00:05:00  0          ...
1   2020-03-10 00:07:00  0          ...
1   2020-05-19 00:00:00  0          ...
1   2020-05-20 00:00:00  1          ...

into this

id  datetime             group      other_values ...
1   2020-01-14 00:12:00  A          ...
1   2020-01-17 00:23:00  A          ...
1   2020-01-17 00:13:00  A          ...
1   2020-05-19 00:00:00  B          ...
1   2020-05-20 00:00:00  B          ...

or a similar way to group into group A, B, ... .

A naive python for-loop is not possible due to taking ages for a dataset like this. There is propably a clever way to use df.groupby('id'), df.groupby('id').agg(...), df.sort_values(...) or df.apply(), but I just do not see it.

Upvotes: 1

Views: 70

Answers (2)

Stu Sztukowski
Stu Sztukowski

Reputation: 12909

Here is a way to do it with pd.merge_asof(). Let's create our data:

data = {'id': [1,1,1,1,1,1,1],
        'datetime': ['2020-01-14 00:12:00', 
                    '2020-01-17 00:23:00',
                    '2020-01-17 00:13:00',
                    '2020-01-20 00:05:00',
                    '2020-03-10 00:07:00',
                    '2020-05-19 00:00:00',
                    '2020-05-20 00:00:00'],
        'ind': [0,1,0,0,0,0,1]
       }

df = pd.DataFrame(data)
df['datetime'] = df['datetime'].astype('datetime64')

Data:

   id            datetime  ind
0   1 2020-01-14 00:12:00    0
1   1 2020-01-17 00:23:00    1
2   1 2020-01-17 00:13:00    0
3   1 2020-01-20 00:05:00    0
4   1 2020-03-10 00:07:00    0
5   1 2020-05-19 00:00:00    0
6   1 2020-05-20 00:00:00    1

Next, let's add a date to the dataset and pull all dates where the indicator is 1.

df['date'] = df['datetime'].dt.date.astype('datetime64')
df2 = df.loc[df['ind'] == 1, ['id', 'date', 'ind']].rename({'ind': 'ind2'}, axis=1)

Which gives us this:

df:

   id            datetime  ind       date
0   1 2020-01-14 00:12:00    0 2020-01-14
1   1 2020-01-17 00:23:00    1 2020-01-17
2   1 2020-01-17 00:13:00    0 2020-01-17
3   1 2020-01-20 00:05:00    0 2020-01-20
4   1 2020-03-10 00:07:00    0 2020-03-10
5   1 2020-05-19 00:00:00    0 2020-05-19
6   1 2020-05-20 00:00:00    1 2020-05-20

df2:

   id       date  ind2
1   1 2020-01-17     1
6   1 2020-05-20     1

Now let's join them using pd.merge_asof() with direction=forward and a tolerance of 10 days. This will join all data up to 10 days looking forward.

df = pd.merge_asof(df.drop('ind', axis=1), df2, by='id', on='date', tolerance=pd.Timedelta('10d'), direction='forward')

Which gives us this:

   id            datetime  ind       date  ind2
0   1 2020-01-14 00:12:00    0 2020-01-14   1.0
1   1 2020-01-17 00:23:00    1 2020-01-17   1.0
2   1 2020-01-17 00:13:00    0 2020-01-17   1.0
3   1 2020-01-20 00:05:00    0 2020-01-20   NaN
4   1 2020-03-10 00:07:00    0 2020-03-10   NaN
5   1 2020-05-19 00:00:00    0 2020-05-19   1.0
6   1 2020-05-20 00:00:00    1 2020-05-20   1.0

Next, let's work on creating groups. There are three rules we want to use:

  1. The next value of ind2 is NaN
  2. The next value of ID is not the current value of ID (we're at the last value in the group)
  3. The next day is 10 days greater than the current

With these rules, we can create a Boolean which we can then cumulatively sum to create our groups.

df['group_id'] = df['ind2'].eq(  (df['ind2'].shift() == np.NaN) 
                               | (df['id'].shift() != df['id'])
                               | (df['date'] - df['date'].shift() > pd.Timedelta('10d') ) 
                              ).cumsum()

    id            datetime  ind       date  ind2  group_id
0    1 2020-01-14 00:12:00    0 2020-01-14   1.0         1
1    1 2020-01-17 00:23:00    1 2020-01-17   1.0         1
2    1 2020-01-17 00:13:00    0 2020-01-17   1.0         1
3    1 2020-01-20 00:05:00    0 2020-01-20   NaN         1
4    1 2020-03-10 00:07:00    0 2020-03-10   NaN         1
5    1 2020-05-19 00:00:00    0 2020-05-19   1.0         2
6    1 2020-05-20 00:00:00    1 2020-05-20   1.0         2

Now we need to drop all the NaNs from ind2, remove date and we're done.

df = df.dropna(subset='ind2').drop(['date', 'ind2'], axis=1)

Final output:

   id            datetime  ind  group_id
0   1 2020-01-14 00:12:00    0         1
1   1 2020-01-17 00:23:00    1         1
2   1 2020-01-17 00:13:00    0         1
5   1 2020-05-19 00:00:00    0         2
6   1 2020-05-20 00:00:00    1         2

Upvotes: 3

grey_ranger
grey_ranger

Reputation: 1030

I'm not aware of a way to do this with df.agg, but you can put your for loop inside the groupby using .apply(). That way, your comparisons/lookups can be done on smaller tables, then groupby will handle the re-concatenation:

import pandas as pd
import datetime
import uuid

df = pd.DataFrame({
    "id": [1, 1, 1, 2, 2, 2],
    "datetime": [
        '2020-01-14 00:12:00',
        '2020-01-17 00:23:00',
        '2021-02-01 00:00:00',
        '2020-01-15 00:05:00',
        '2020-03-10 00:07:00',
        '2021-05-22 00:00:00',
    ],
    "indicator": [0, 1, 0, 0, 0, 1]
})
df.datetime = pd.to_datetime(df.datetime)

timedelta = datetime.timedelta(days=10)
def consolidate(grp):
    grp['Group'] = None
    for time in grp[grp.indicator == 1]['datetime']:
        grp['Group'][grp['datetime'].between(time - timedelta, time)] = uuid.uuid4()
    return grp.dropna(subset=['Group'])

df.groupby('id').apply(consolidate)

If there are multiple rows with indicator == 1 in each id grouping, then the for loop will apply in index order (so a later group might overwrite an earlier group). If you can be certain that there is only one indicator == 1 in each grouping, we can simplify the consolidate function:

def consolidate(grp):
    time = grp[grp.indicator == 1]['datetime'].iloc[0]
    grp = grp[grp['datetime'].between(time - timedelta, time)]
    grp['Group'] = uuid.uuid4()
    return grp

Upvotes: 2

Related Questions