nscricco
nscricco

Reputation: 105

Pandas merge with logic

I would like to merge two dataframes, but can't exactly figure out how to do so without iterating. Basically, I want to merge a row from df2 to df1 if df1.date >= df2.start_date and df1.date <= df2.end_date. See example below:

df1:
index   date         value
0       2012-08-01   82
1       2012-08-02   20
2       2012-08-03   94
...
n-1     2012-10-29   58
n       2012-10-30   73

df2:
index   start_date   end_date     other_value
0       2012-08-01   2012-09-04   'foo'
1       2012-09-05   2012-10-15   'bar'
2       2012-10-16   2012-11-01   'foobar'
...


final_df:
index   df2_index   date         value  other_value
0       0           2012-08-01   82     'foo'
1       0           2012-08-02   20     'foo'
2       0           2012-08-03   94     'foo'
...
n-1     2           2012-10-29   58     'foobar'
n       2           2012-10-30   73     'foobar'

I thought about creating a date series vector to merge with df2, so that I can combine on date, but it seems very manual and does not leverage the power/speed of pandas. I also thought about trying to expand df2 into single days, but couldn't find any way to do so without a manual / iteration type solution.

Upvotes: 10

Views: 3621

Answers (1)

unutbu
unutbu

Reputation: 880449

The naive iterative approach is O(n*m), where n = len(df1) and m = len(df2), since for each date in df1 you would have to check its inclusion in up to m intervals.

If the intervals defined by df2 are disjoint, then there is a theoretically better way: use searchsorted to find where each date in df1 fits amongst the start_dates, and then use searchsorted a second time to find where each date fits amongst the end_dates. When the index from the two calls to searchsorted are equal, the date falls inside an interval.

Searchsorted assumes the cutoff dates are sorted and it uses binary search, so each call has complexity O(n*log(m)).

If m is large enough, using searchsorted should be faster than the naive iterative approach.

If m is not large, the iterative approach may be faster.


Here is an example, using searchsorted:

import numpy as np
import pandas as pd
Timestamp = pd.Timestamp
df1 = pd.DataFrame({'date': (Timestamp('2012-08-01'),
                             Timestamp('2012-08-02'),
                             Timestamp('2012-08-03'),
                             Timestamp('2012-10-29'),
                             Timestamp('2012-10-30'),
                             Timestamp('2012-11-01'),
                             Timestamp('2012-10-15'),  # on then end_date
                             Timestamp('2012-09-04'),  # outside an interval
                             Timestamp('2012-09-05'),  # on then start_date
                             ),
                    'value': (82, 20, 94, 58, 73, 1, 2, 3, 4)})

print(df1)
df2 = pd.DataFrame({'end_date': (
                        Timestamp('2012-10-15'),
                        Timestamp('2012-09-04'),
                        Timestamp('2012-11-01')),
                    'other_value': ("foo", "bar", "foobar"),
                    'start_date': (
                        Timestamp('2012-09-05'),
                        Timestamp('2012-08-01'),
                        Timestamp('2012-10-16'))})
df2 = df2.reindex(columns=['start_date', 'end_date', 'other_value'])
df2.sort(['start_date'], inplace=True)
print(df2)

# Convert to DatetimeIndexes so we can call the searchsorted method
date_idx = pd.DatetimeIndex(df1['date'])
start_date_idx = pd.DatetimeIndex(df2['start_date'])
# Add one to the end_date so the original end_date will be included in the
# half-open interval.
end_date_idx = pd.DatetimeIndex(df2['end_date'])+pd.DateOffset(days=1)

start_idx = start_date_idx.searchsorted(date_idx, side='right')-1
end_idx = end_date_idx.searchsorted(date_idx, side='right')
df1['idx'] = np.where(start_idx == end_idx, end_idx, np.nan)

result = pd.merge(df1, df2, left_on=['idx'], right_index=True)
result = result.reindex(columns=['idx', 'date', 'value', 'other_value'])
print(result)

With df1 equal to

        date  value
0 2012-08-01     82
1 2012-08-02     20
2 2012-08-03     94
3 2012-10-29     58
4 2012-10-30     73
5 2012-11-01      1
6 2012-10-15      2
7 2012-09-04      3
8 2012-09-05      4

and df2 equal to

  start_date   end_date other_value
1 2012-08-01 2012-09-04         bar
0 2012-09-05 2012-10-15         foo
2 2012-10-16 2012-11-01      foobar

the above code yields

   idx       date  value other_value
0    0 2012-08-01     82         foo
1    0 2012-08-02     20         foo
2    0 2012-08-03     94         foo
7    0 2012-09-04      3         foo
3    2 2012-10-29     58      foobar
4    2 2012-10-30     73      foobar
5    2 2012-11-01      1      foobar
6    1 2012-10-15      2         bar
8    1 2012-09-05      4         bar

Upvotes: 10

Related Questions