Reputation: 33
Looking to match two data frames using times in one dataframe that fall into time windows of another dataframe.
Production Dataframe
Production Time | Product | Value | Worker_ID |
---|---|---|---|
2020-01-24 08:13:59 | Prod4 | 5.9 | 402 |
2020-01-24 08:15:38 | Prod5 | 5.7 | 402 |
2020-01-24 08:17:17 | Prod4 | 5.1 | 402 |
2020-01-25 22:13:59 | Prod4 | 5.9 | 402 |
2020-01-25 21:15:38 | Prod7 | 5.7 | 402 |
2020-01-26 02:17:17 | Prod2 | 5.1 | 402 |
2020-01-24 09:17:17 | Prod4 | 5.1 | 403 |
2020-01-25 21:13:59 | Prod5 | 5.9 | 403 |
Location Dataframe
Location | window_start | window_stop | Worker_ID |
---|---|---|---|
Loc16 | 2020-01-24 05:00:00 | 2020-01-24 21:00:00 | 402 |
Loc27 | 2020-01-25 21:00:00 | 2020-01-26 05:00:00 | 402 |
Loc61 | 2020-01-24 05:00:00 | 2020-01-24 21:00:00 | 403 |
Loc27 | 2020-01-25 21:00:00 | 2020-01-26 05:00:00 | 403 |
Results would look like this:
Location | window_start | window_stop | Worker_ID | Production Time | Product | Quality |
---|---|---|---|---|---|---|
Loc16 | 2020-01-24 05:00:00 | 2020-01-24 21:00:00 | 402 | 2020-01-24 08:13:59 | Prod4 | 5.9 |
Loc16 | 2020-01-24 05:00:00 | 2020-01-24 21:00:00 | 402 | 2020-01-24 08:15:38 | Prod5 | 5.7 |
Loc16 | 2020-01-24 05:00:00 | 2020-01-24 21:00:00 | 402 | 2020-01-24 08:17:17 | Prod4 | 5.1 |
Loc27 | 2020-01-25 21:00:00 | 2020-01-26 05:00:00 | 402 | 2020-01-25 22:13:59 | Prod4 | 5.9 |
Loc27 | 2020-01-25 21:00:00 | 2020-01-26 05:00:00 | 402 | 2020-01-25 21:15:38 | Prod7 | 5.7 |
Loc27 | 2020-01-25 21:00:00 | 2020-01-26 05:00:00 | 402 | 2020-01-26 02:17:17 | Prod2 | 5.1 |
Loc61 | 2020-01-24 05:00:00 | 2020-01-24 21:00:00 | 403 | 2020-01-24 09:17:17 | Prod4 | 5.1 |
Loc27 | 2020-01-25 21:00:00 | 2020-01-26 05:00:00 | 403 | 2020-01-25 21:13:59 | Prod5 | 5.9 |
Where the match is made first on Worker_ID then where the Production datetime falls in the datetime window of the the location.
This code works:
possible_matches = location_df.merge(production_df,on='Worker_ID',how='left')
build_df = possible_matches[(possible_matches['Production Time'] >= possible_matches['window_start']) &
(possible_matches['Production Time'] <= possible_matches['window_stop'])]
But does not work when there are millions of rows in the production dataframe and thousands of rows in the location dataframe.
Looking for a more efficient way of doing this join that actually works with large datasets with more workers and locations.
Upvotes: 0
Views: 467
Reputation: 120391
To avoid crash, you may have to check datetimes before merging:
I tried to generate 2 dataframes with 10,000 records for location and 5,000,000 for production.
dti = pd.date_range('2020-01-01', '2021-01-01', freq='H', closed='left')
df2 = pd.DataFrame({'Worker_ID': np.random.randint(100, 500, 10000)})
df2['window_start'] = np.random.choice(dti, len(df2))
df2['window_stop'] = df2['window_start'] + pd.DateOffset(hours=np.random.randint(4, 17))
df1 = pd.DataFrame({'Worker_ID': np.random.randint(100, 500, 5000000)})
df1['Production Time'] = pd.to_datetime(1e9 * np.random.randint(df2['window_start'].min().timestamp(), df2['window_stop'].max().timestamp(), len(df1)))
>>> df1
Worker_ID Production Time
0 263 2020-12-31 11:28:31
1 194 2020-09-19 04:57:17
2 139 2020-06-14 00:27:07
3 105 2020-04-14 02:45:05
4 484 2020-12-07 22:36:56
... ... ...
4999995 338 2020-05-29 18:30:39
4999996 455 2020-03-03 20:51:27
4999997 228 2020-12-19 01:43:12
4999998 197 2020-04-07 07:32:13
4999999 304 2020-07-06 14:51:39
[5000000 rows x 2 columns]
>>> df2
Worker_ID window_start window_stop
0 309 2020-10-07 18:00:00 2020-10-08 08:00:00
1 486 2020-01-24 19:00:00 2020-01-25 09:00:00
2 120 2020-11-05 10:00:00 2020-11-06 00:00:00
3 224 2020-04-08 15:00:00 2020-04-09 05:00:00
4 208 2020-01-08 23:00:00 2020-01-09 13:00:00
... ... ... ...
9995 218 2020-01-10 00:00:00 2020-01-10 14:00:00
9996 358 2020-10-12 03:00:00 2020-10-12 17:00:00
9997 474 2020-12-25 03:00:00 2020-12-25 17:00:00
9998 416 2020-10-26 20:00:00 2020-10-27 10:00:00
9999 443 2020-03-31 09:00:00 2020-03-31 23:00:00
[10000 rows x 3 columns]
# from tqdm import tqdm
# Convert datetime to arrays of int
ptime = df1['Production Time'].astype(int).values
wtime = df2[['window_start', 'window_stop']].astype(int).values
data = []
# for wid in tqdm(df2['Worker_ID'].unique()):
for wid in df2['Worker_ID'].unique():
i = df1.loc[df1['Worker_ID'] == wid]
j = df2.loc[df2['Worker_ID'] == wid]
m = [np.where((wtime[j.index, 0] <= p) & (p <= wtime[j.index, 1]), x, -1)
for x, p in enumerate(ptime[i.index])]
m = np.where(np.array(m) >= 0)
df = pd.concat([j.iloc[m[1]].reset_index(drop=True),
i.iloc[m[0]].reset_index(drop=True)], axis='columns')
data.append(df)
df = pd.concat(data)
Old answer Create and interval index to bind each production time to the corresponding window and merge on Worker_ID and the interval:
ii = pd.IntervalIndex.from_tuples(list(zip(dfl['window_start'], dfl['window_stop'])),
closed='left') # left means >= and <
dfp['interval'] = pd.cut(dfp['Production Time'], bins=ii)
dfl['interval'] = ii
>>> pd.merge(dfl, dfp, on=['Worker_ID', 'interval'], how='left') \
.drop(columns='interval')
Location window_start window_stop Worker_ID Production Time Product Value
0 Loc16 2020-01-24 05:00:00 2020-01-24 21:00:00 402 2020-01-24 08:13:59 Prod4 5.9
1 Loc16 2020-01-24 05:00:00 2020-01-24 21:00:00 402 2020-01-24 08:15:38 Prod5 5.7
2 Loc16 2020-01-24 05:00:00 2020-01-24 21:00:00 402 2020-01-24 08:17:17 Prod4 5.1
3 Loc27 2020-01-25 21:00:00 2020-01-26 05:00:00 402 2020-01-25 22:13:59 Prod4 5.9
4 Loc27 2020-01-25 21:00:00 2020-01-26 05:00:00 402 2020-01-25 21:15:38 Prod7 5.7
5 Loc27 2020-01-25 21:00:00 2020-01-26 05:00:00 402 2020-01-26 02:17:17 Prod2 5.1
Upvotes: 1