lowjumpingfrog
lowjumpingfrog

Reputation: 33

Joining data frames with datetime windows

Looking to match two data frames using times in one dataframe that fall into time windows of another dataframe.

Production Dataframe

Production Time Product Value Worker_ID
2020-01-24 08:13:59 Prod4 5.9 402
2020-01-24 08:15:38 Prod5 5.7 402
2020-01-24 08:17:17 Prod4 5.1 402
2020-01-25 22:13:59 Prod4 5.9 402
2020-01-25 21:15:38 Prod7 5.7 402
2020-01-26 02:17:17 Prod2 5.1 402
2020-01-24 09:17:17 Prod4 5.1 403
2020-01-25 21:13:59 Prod5 5.9 403

Location Dataframe

Location window_start window_stop Worker_ID
Loc16 2020-01-24 05:00:00 2020-01-24 21:00:00 402
Loc27 2020-01-25 21:00:00 2020-01-26 05:00:00 402
Loc61 2020-01-24 05:00:00 2020-01-24 21:00:00 403
Loc27 2020-01-25 21:00:00 2020-01-26 05:00:00 403

Results would look like this:

Location window_start window_stop Worker_ID Production Time Product Quality
Loc16 2020-01-24 05:00:00 2020-01-24 21:00:00 402 2020-01-24 08:13:59 Prod4 5.9
Loc16 2020-01-24 05:00:00 2020-01-24 21:00:00 402 2020-01-24 08:15:38 Prod5 5.7
Loc16 2020-01-24 05:00:00 2020-01-24 21:00:00 402 2020-01-24 08:17:17 Prod4 5.1
Loc27 2020-01-25 21:00:00 2020-01-26 05:00:00 402 2020-01-25 22:13:59 Prod4 5.9
Loc27 2020-01-25 21:00:00 2020-01-26 05:00:00 402 2020-01-25 21:15:38 Prod7 5.7
Loc27 2020-01-25 21:00:00 2020-01-26 05:00:00 402 2020-01-26 02:17:17 Prod2 5.1
Loc61 2020-01-24 05:00:00 2020-01-24 21:00:00 403 2020-01-24 09:17:17 Prod4 5.1
Loc27 2020-01-25 21:00:00 2020-01-26 05:00:00 403 2020-01-25 21:13:59 Prod5 5.9

Where the match is made first on Worker_ID then where the Production datetime falls in the datetime window of the the location.

This code works:

possible_matches = location_df.merge(production_df,on='Worker_ID',how='left')
build_df = possible_matches[(possible_matches['Production Time'] >= possible_matches['window_start']) & 
                            (possible_matches['Production Time'] <= possible_matches['window_stop'])]

But does not work when there are millions of rows in the production dataframe and thousands of rows in the location dataframe.

Looking for a more efficient way of doing this join that actually works with large datasets with more workers and locations.

Upvotes: 0

Views: 467

Answers (1)

Corralien
Corralien

Reputation: 120391

To avoid crash, you may have to check datetimes before merging:

I tried to generate 2 dataframes with 10,000 records for location and 5,000,000 for production.

dti = pd.date_range('2020-01-01', '2021-01-01', freq='H', closed='left')

df2 = pd.DataFrame({'Worker_ID': np.random.randint(100, 500, 10000)})
df2['window_start'] = np.random.choice(dti, len(df2))
df2['window_stop'] = df2['window_start'] + pd.DateOffset(hours=np.random.randint(4, 17))

df1 = pd.DataFrame({'Worker_ID': np.random.randint(100, 500, 5000000)})
df1['Production Time'] = pd.to_datetime(1e9 * np.random.randint(df2['window_start'].min().timestamp(), df2['window_stop'].max().timestamp(), len(df1)))
>>> df1
         Worker_ID     Production Time
0              263 2020-12-31 11:28:31
1              194 2020-09-19 04:57:17
2              139 2020-06-14 00:27:07
3              105 2020-04-14 02:45:05
4              484 2020-12-07 22:36:56
...            ...                 ...
4999995        338 2020-05-29 18:30:39
4999996        455 2020-03-03 20:51:27
4999997        228 2020-12-19 01:43:12
4999998        197 2020-04-07 07:32:13
4999999        304 2020-07-06 14:51:39

[5000000 rows x 2 columns]

>>> df2
      Worker_ID        window_start         window_stop
0           309 2020-10-07 18:00:00 2020-10-08 08:00:00
1           486 2020-01-24 19:00:00 2020-01-25 09:00:00
2           120 2020-11-05 10:00:00 2020-11-06 00:00:00
3           224 2020-04-08 15:00:00 2020-04-09 05:00:00
4           208 2020-01-08 23:00:00 2020-01-09 13:00:00
...         ...                 ...                 ...
9995        218 2020-01-10 00:00:00 2020-01-10 14:00:00
9996        358 2020-10-12 03:00:00 2020-10-12 17:00:00
9997        474 2020-12-25 03:00:00 2020-12-25 17:00:00
9998        416 2020-10-26 20:00:00 2020-10-27 10:00:00
9999        443 2020-03-31 09:00:00 2020-03-31 23:00:00

[10000 rows x 3 columns]
# from tqdm import tqdm

# Convert datetime to arrays of int
ptime = df1['Production Time'].astype(int).values
wtime = df2[['window_start', 'window_stop']].astype(int).values

data = []
# for wid in tqdm(df2['Worker_ID'].unique()):
for wid in df2['Worker_ID'].unique():
    i = df1.loc[df1['Worker_ID'] == wid]
    j = df2.loc[df2['Worker_ID'] == wid]

    m = [np.where((wtime[j.index, 0] <= p) & (p <= wtime[j.index, 1]), x, -1)
             for x, p in enumerate(ptime[i.index])]
    m = np.where(np.array(m) >= 0)

    df = pd.concat([j.iloc[m[1]].reset_index(drop=True),
                    i.iloc[m[0]].reset_index(drop=True)], axis='columns')

    data.append(df)
df = pd.concat(data)

Old answer Create and interval index to bind each production time to the corresponding window and merge on Worker_ID and the interval:

ii = pd.IntervalIndex.from_tuples(list(zip(dfl['window_start'], dfl['window_stop'])),
                                  closed='left')  # left means >= and <

dfp['interval'] = pd.cut(dfp['Production Time'], bins=ii)
dfl['interval'] = ii
>>> pd.merge(dfl, dfp, on=['Worker_ID', 'interval'], how='left') \
      .drop(columns='interval')

  Location        window_start         window_stop  Worker_ID     Production Time Product  Value
0    Loc16 2020-01-24 05:00:00 2020-01-24 21:00:00        402 2020-01-24 08:13:59   Prod4    5.9
1    Loc16 2020-01-24 05:00:00 2020-01-24 21:00:00        402 2020-01-24 08:15:38   Prod5    5.7
2    Loc16 2020-01-24 05:00:00 2020-01-24 21:00:00        402 2020-01-24 08:17:17   Prod4    5.1
3    Loc27 2020-01-25 21:00:00 2020-01-26 05:00:00        402 2020-01-25 22:13:59   Prod4    5.9
4    Loc27 2020-01-25 21:00:00 2020-01-26 05:00:00        402 2020-01-25 21:15:38   Prod7    5.7
5    Loc27 2020-01-25 21:00:00 2020-01-26 05:00:00        402 2020-01-26 02:17:17   Prod2    5.1

Upvotes: 1

Related Questions