Luke
Luke

Reputation: 7099

Pandas joining based on date

I'm trying to join two dataframes with dates that don't perfectly match up. For a given group/date in the left dataframe, I want to join the corresponding record from the right dataframe with the a date just before that of the left dataframe. Probably easiest to show with an example.

df1:

group     date      teacher
  a     1/10/00        1
  a     2/27/00        1
  b     1/7/00         1
  b     4/5/00         1
  c     2/9/00         2
  c     9/12/00        2

df2:

teacher    date    hair length
   1       1/1/00       4
   1       1/5/00       8
   1       1/30/00     20
   1       3/20/00    100
   2       1/1/00       0
   2       8/10/00     50

Gives us:

group     date      teacher    hair length
  a     1/10/00        1           8
  a     2/27/00        1          20
  b     1/7/00         1           8
  b     4/5/00         1         100
  c     2/9/00         2           0
  c     9/12/00        2          50

Edit 1: Hacked together a way to do this. Basically I iterate through every row in df1 and pick out the most recent corresponding entry in df2. It is insanely slow, surely there must be a better way.

Upvotes: 1

Views: 318

Answers (2)

Luke
Luke

Reputation: 7099

Seems like the quickest way to do this is using sqlite via pysqldf:

def partial_versioned_join(tablea, tableb, tablea_keys, tableb_keys):

    try:
        tablea_group, tablea_date = tablea_keys
        tableb_group, tableb_date = tableb_keys
    except ValueError, e:
        raise(e, 'Need to pass in both a group and date key for both tables')

    # Note: can't actually use group here as a field name due to sqlite
    statement = """SELECT a.group, a.{date_a} AS {temp_date}, b.*
                    FROM (SELECT tablea.group, tablea.{date_a}, tablea.{group_a},
                         MAX(tableb.{date_b}) AS tdate
                        FROM tablea
                        JOIN tableb
                        ON tablea.{group_a}=tableb.{group_b}
                        AND tablea.{date_a}>=tableb.{date_b}
                        GROUP BY tablea.{base_id}, tablea.{date_a}, tablea.{group_a}
                        ) AS a
                    JOIN tableb b
                    ON   a.{group_a}=b.{group_b}
                    AND  a.tdate=b.{date_b};
                    """.format(group_a=tablea_group, date_a=tablea_date, 
                               group_b=tableb_group, date_b=tableb_date,
                               temp_date='join_date', base_id=base_id)
    # Note: you lose types here for tableb so you may want to save them
    pre_join_tableb = sqldf(statement, locals())
    return pd.merge(tablea, pre_join_tableb, how='inner',
                    left_on=['group'] + tablea_keys,
                    right_on=['group', tableb_group, 'join_date'])

Upvotes: 0

ely
ely

Reputation: 77504

One way to do this is to create a new column in the left data frame, which will (for a given row's date) determine the value that is closest and earlier:

df1['join_date'] = df1.date.map(lambda x: df2.date[df2.date <= x].max())

then a regular join or merge between 'join_date' on the left and 'date' on the right will work. You may need to tweak the function to handle Null values or other corner cases.

This is not very efficient (you are searching the right-hand dates over and over). A more efficient approach is to sort both data frames by the dates, iterate through the left-hand data frame, and consume entries from the right hand data frame just until the date is larger:

# Assuming df1 and df2 are sorted by the dates

df1['hair length'] = 0 # initialize

r_generator = df2.iterrows()
_, cur_r_row = next(r_generator)

for i, l_row in df1.iterrows():
    cur_hair_length = 0 # Assume 0 works when df1 has a date earlier than df2

    while cur_r_row['date'] <= l_row['date']:
        cur_hair_length = cur_r_row['hair length']
        try:
            _, cur_r_row = next(r_generator)
        except StopIteration:
            break

    df1.loc[i, 'hair length'] = cur_hair_length

Upvotes: 1

Related Questions