jab
jab

Reputation: 5823

Preserving Left/Right Join Logic in Pandas Conditional Join

Pandas currently doesn't support a conditional join in the SQL sense; however, one can be emulated (with lesser performance) by performing a larger join on common fields and then applying a filter on the post-join processing.

However, I'm looking for a way to preserve left (or right) elements through this conditional join. I lose them through the post-processing.

import pandas

# Create a dataframe
df = pandas.DataFrame([{'name': 'A', 'start': '20171201', 'end': '20180205'}, {'name': 'A', 'start': '20170901', 'end': '20170905'}, {'name': 'B', 'start': '20190101', 'end': '20190205'}])
df['start'] = pandas.to_datetime(df['start'])
df['end'] = pandas.to_datetime(df['end'])
print df


             end name      start
    0 2018-02-05    A 2017-12-01
    1 2017-09-05    A 2017-09-01
    2 2019-02-05    B 2019-01-01



# Create another dataframe, don't want to lose any data here.
v_df = pandas.DataFrame([{'name': 'A', 'val': 10, 'date': '20180101'}, {'name': 'B', 'val': 20, 'date': '20170101'}])
v_df['date'] = pandas.to_datetime(v_df['date'])
print v_df

            date name  val
    0 2018-01-01    A   10
    1 2017-01-01    B   20



# Conditional Left Join both dataframes, want to avoid losing the name B val.
v_df = v_df.merge(df, how='left', on=['name'])
v_df = v_df[v_df['date'].between(v_df['start'], v_df['end'])]
print v_df


            date name  val        end      start
    0 2018-01-01    A   10 2018-02-05 2017-12-01

Desired output in this case would be the following, which includes a left record from B.

        date name  val        end      start
0 2018-01-01    A   10 2018-02-05 2017-12-01
2 2017-01-01    B   20        NaT        NaT

A solution would also need to be able to handle when multiple records match in the general merge, but when none of that group match the condition, a null record is returned (rather than no record at all).

Upvotes: 1

Views: 497

Answers (2)

EFT
EFT

Reputation: 2369

filter_df = df.merge(v_df)
filter_df = filter_df[filter_df['date'].between(filter_df['start'], filter_df['end'])]
v_df.merge(filter_df, how='left')

should return what you want. The filter_df identifies what rows a join properly works on, which you do in the question, and the second merge attaches these start and end values to only the relevant rows.

Edit:

@MaxU's answer is functionally equivalent and syntactically more elegant.

Upvotes: 2

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210852

I'm not sure this is the best approach, but it seems to do the job:

In [191]: v_df.merge(v_df.merge(df).query("start <= date <= end"), how='left')
Out[191]:
        date name  val        end      start
0 2018-01-01    A   10 2018-02-05 2017-12-01
1 2017-01-01    B   20        NaT        NaT

Upvotes: 2

Related Questions