Pandas joining based on date

Question

I'm trying to join two dataframes with dates that don't perfectly match up. For a given group/date in the left dataframe, I want to join the corresponding record from the right dataframe with the a date just before that of the left dataframe. Probably easiest to show with an example.

df1:

group     date      teacher
  a     1/10/00        1
  a     2/27/00        1
  b     1/7/00         1
  b     4/5/00         1
  c     2/9/00         2
  c     9/12/00        2

df2:

teacher    date    hair length
   1       1/1/00       4
   1       1/5/00       8
   1       1/30/00     20
   1       3/20/00    100
   2       1/1/00       0
   2       8/10/00     50

Gives us:

group     date      teacher    hair length
  a     1/10/00        1           8
  a     2/27/00        1          20
  b     1/7/00         1           8
  b     4/5/00         1         100
  c     2/9/00         2           0
  c     9/12/00        2          50

Edit 1: Hacked together a way to do this. Basically I iterate through every row in df1 and pick out the most recent corresponding entry in df2. It is insanely slow, surely there must be a better way.

Luke · Accepted Answer

Seems like the quickest way to do this is using sqlite via pysqldf:

def partial_versioned_join(tablea, tableb, tablea_keys, tableb_keys):

    try:
        tablea_group, tablea_date = tablea_keys
        tableb_group, tableb_date = tableb_keys
    except ValueError, e:
        raise(e, 'Need to pass in both a group and date key for both tables')

    # Note: can't actually use group here as a field name due to sqlite
    statement = """SELECT a.group, a.{date_a} AS {temp_date}, b.*
                    FROM (SELECT tablea.group, tablea.{date_a}, tablea.{group_a},
                         MAX(tableb.{date_b}) AS tdate
                        FROM tablea
                        JOIN tableb
                        ON tablea.{group_a}=tableb.{group_b}
                        AND tablea.{date_a}>=tableb.{date_b}
                        GROUP BY tablea.{base_id}, tablea.{date_a}, tablea.{group_a}
                        ) AS a
                    JOIN tableb b
                    ON   a.{group_a}=b.{group_b}
                    AND  a.tdate=b.{date_b};
                    """.format(group_a=tablea_group, date_a=tablea_date, 
                               group_b=tableb_group, date_b=tableb_date,
                               temp_date='join_date', base_id=base_id)
    # Note: you lose types here for tableb so you may want to save them
    pre_join_tableb = sqldf(statement, locals())
    return pd.merge(tablea, pre_join_tableb, how='inner',
                    left_on=['group'] + tablea_keys,
                    right_on=['group', tableb_group, 'join_date'])

Pandas joining based on date

Answers (2)

Related Questions