Reputation: 901
I have two data frames.
Data Frame #1 (df1):
1 object_name time_epoch_ms source data
2 a 1538518822490 source_1 some_data
3 b 1538528822490 source_2 some_data
4 b 1538538822490 source_2 some_data
5 b 1538548822490 source_3 some_data
6 b 1538558822490 source_1 some_data
7 c 1538568822490 source_2 some_data
8 c 1538578822490 source_2 some_data
9 c 1538588822490 source_2 some_data
... etc etc
Data Frame #2(df2):
1 object_name time_epoch_ms new_data
2 a 1538518722490 x
3 b 1538528822490 y
4 b 1538518922490 z
5 b 1538519922490 a
6 b 1538598822490 b
7 c 1538548822490 c
8 c 1538538822490 c
9 c 1538528822490 d
... etc etc
Entries in these two tables give information about objects with object_names for different points in time.
I would like to combine these two tables such that Data Frame #2 is augmented with the source the object_name was using at a particular time specified in Data Frame #1.
Problem: Timestamps in the two tables do not exactly match. Some timestamps that exist in table 2 do not exist in table 1. However, they should roughly correspond to the hour. Some timestamps in Table 1 are missing, so the "last available data" is best.
Is there a way to execute the following steps using pandas' merge ability?
My approach:
I currently do .apply to get each row from df2 and find all the timestamps for that object in df1, if they exist. Then, I return the closest match or null.
I am wondering if there is a more elegant way to do this using pandas' merge or concat functionality, but I am having trouble understanding how to use them in this case and how to handle filling data in and matching based to the hour (without doing separate pre-processing to get an hour column in there).
Upvotes: 1
Views: 52
Reputation: 16194
In general Pandas makes "equi-joins" easy, but other sorts are kind of difficult. In this case you're lucky, as there's a nice method called merge_asof
that should do what you need.
It's a bit pedantic about how your data is set up, but a MWE is:
from io import StringIO
import pandas as pd
df1 = pd.read_table(StringIO("""1 object_name time_epoch_ms source data
2 a 1538518822490 source_1 some_data_1
3 b 1538528822490 source_2 some_data_2
4 b 1538538822490 source_2 some_data_3
5 b 1538548822490 source_3 some_data_4
6 b 1538558822490 source_1 some_data_5
7 c 1538568822490 source_2 some_data_6
8 c 1538578822490 source_2 some_data_7
9 c 1538588822490 source_2 some_data_8
"""), sep=r"\s+", index_col=0)
df2 = pd.read_table(StringIO("""1 object_name time_epoch_ms new_data
2 a 1538518722490 x
3 b 1538528822490 y
4 b 1538518922490 z
5 b 1538519922490 a
6 b 1538598822490 b
7 c 1538548822490 c
8 c 1538538822490 c
9 c 1538528822490 d
"""), sep=r"\s+", index_col=0)
pd.merge_asof(
df2.sort_values(['time_epoch_ms', 'object_name']),
df1.sort_values(['time_epoch_ms', 'object_name']),
by="object_name", on="time_epoch_ms",
direction='forward',
).sort_values(['object_name', 'time_epoch_ms'])
which gives back:
object_name time_epoch_ms new_data source data
0 a 1538518722490 x source_1 some_data_1
1 b 1538518922490 z source_2 some_data_2
2 b 1538519922490 a source_2 some_data_2
3 b 1538528822490 y source_2 some_data_2
7 b 1538598822490 b NaN NaN
4 c 1538528822490 d source_2 some_data_6
5 c 1538538822490 c source_2 some_data_6
6 c 1538548822490 c source_2 some_data_6
See Pandas equivalent of SQL non-equi JOIN for another example. There's also merge_ordered
but I don't think that helps your case.
Upvotes: 2