RebeccaK375
RebeccaK375

Reputation: 901

Join Data Frames Using Closest Hour and Fill in Missing Time Stamps for Particular Objects

I have two data frames.

Data Frame #1 (df1):

1    object_name     time_epoch_ms      source      data
2    a               1538518822490      source_1     some_data
3    b               1538528822490      source_2     some_data
4    b               1538538822490      source_2     some_data
5    b               1538548822490      source_3     some_data
6    b               1538558822490      source_1     some_data
7    c               1538568822490      source_2     some_data
8    c               1538578822490      source_2     some_data
9    c               1538588822490      source_2     some_data
... etc etc

Data Frame #2(df2):

1    object_name     time_epoch_ms      new_data
2    a               1538518722490      x
3    b               1538528822490      y
4    b               1538518922490      z
5    b               1538519922490      a
6    b               1538598822490      b
7    c               1538548822490      c
8    c               1538538822490      c
9    c               1538528822490      d
... etc etc

Entries in these two tables give information about objects with object_names for different points in time.

I would like to combine these two tables such that Data Frame #2 is augmented with the source the object_name was using at a particular time specified in Data Frame #1.

Problem: Timestamps in the two tables do not exactly match. Some timestamps that exist in table 2 do not exist in table 1. However, they should roughly correspond to the hour. Some timestamps in Table 1 are missing, so the "last available data" is best.

Is there a way to execute the following steps using pandas' merge ability?

  1. Match data from dataframes such that df2 gets "source" data from df1 based on object_name and time_stamp such that time_stamp matches to the closest hour.
  2. If data for a particular hour is in df2, but not df1, then source is retried from the last available hour of data, so it is "filled" based on whatever data exists for that object.
  3. If object in df2 is not in df1, then it gets a "null"

My approach:

I currently do .apply to get each row from df2 and find all the timestamps for that object in df1, if they exist. Then, I return the closest match or null.

I am wondering if there is a more elegant way to do this using pandas' merge or concat functionality, but I am having trouble understanding how to use them in this case and how to handle filling data in and matching based to the hour (without doing separate pre-processing to get an hour column in there).

Upvotes: 1

Views: 52

Answers (1)

Sam Mason
Sam Mason

Reputation: 16194

In general Pandas makes "equi-joins" easy, but other sorts are kind of difficult. In this case you're lucky, as there's a nice method called merge_asof that should do what you need.

It's a bit pedantic about how your data is set up, but a MWE is:

from io import StringIO
import pandas as pd

df1 = pd.read_table(StringIO("""1    object_name     time_epoch_ms      source      data
2    a               1538518822490      source_1     some_data_1
3    b               1538528822490      source_2     some_data_2
4    b               1538538822490      source_2     some_data_3
5    b               1538548822490      source_3     some_data_4
6    b               1538558822490      source_1     some_data_5
7    c               1538568822490      source_2     some_data_6
8    c               1538578822490      source_2     some_data_7
9    c               1538588822490      source_2     some_data_8
"""), sep=r"\s+", index_col=0)

df2 = pd.read_table(StringIO("""1    object_name     time_epoch_ms      new_data
2    a               1538518722490      x
3    b               1538528822490      y
4    b               1538518922490      z
5    b               1538519922490      a
6    b               1538598822490      b
7    c               1538548822490      c
8    c               1538538822490      c
9    c               1538528822490      d
"""), sep=r"\s+", index_col=0)

pd.merge_asof(
    df2.sort_values(['time_epoch_ms', 'object_name']),
    df1.sort_values(['time_epoch_ms', 'object_name']),
    by="object_name", on="time_epoch_ms",
    direction='forward',
).sort_values(['object_name', 'time_epoch_ms'])

which gives back:

  object_name  time_epoch_ms new_data    source         data
0           a  1538518722490        x  source_1  some_data_1
1           b  1538518922490        z  source_2  some_data_2
2           b  1538519922490        a  source_2  some_data_2
3           b  1538528822490        y  source_2  some_data_2
7           b  1538598822490        b       NaN          NaN
4           c  1538528822490        d  source_2  some_data_6
5           c  1538538822490        c  source_2  some_data_6
6           c  1538548822490        c  source_2  some_data_6

See Pandas equivalent of SQL non-equi JOIN for another example. There's also merge_ordered but I don't think that helps your case.

Upvotes: 2

Related Questions