Debadri Dutta
Debadri Dutta

Reputation: 1193

pd.merge_asof() based on Time-Difference not merging all values - Pandas

I have two dataframes, one with news and the other with stock price. Both the dataframes have a "Date" column. I want to merge them on a gap of 5 days.

Lets say my news dataframe is df1 and the other price dataframe as df2.

My df1 looks like this:

News_Dates             News
2018-09-29     Huge blow to ABC Corp. as they lost the 2012 tax case
2018-09-30     ABC Corp. suffers a loss
2018-10-01     ABC Corp to Sell stakes
2018-12-20     We are going to comeback strong said ABC CEO
2018-12-22     Shares are down massively for ABC Corp.

My df2 looks like this:

  Dates             Price
2018-10-04           120
2018-12-24           131

First method of merging I do is:

pd.merge_asof(df1_zscore.sort_values(by=['Dates']), df_n.sort_values(by=['News_Dates']), left_on=['Dates'], right_on=['News_Dates'] \
              tolerance=pd.Timedelta('5d'), direction='backward')

The resulting df is:

  Dates       News_Dates   News                                     Price
2018-10-04    2018-10-01  ABC Corp to Sell stakes                    120
2018-12-24    2018-12-22  Shares are down massively for ABC Corp.    131

The second way of merging I do is:

pd.merge_asof(df_n.sort_values(by=['Dates']), df1_zscore.sort_values(by=['Dates']), left_on=['News_Dates'], right_no=['Dates'] \
              tolerance=pd.Timedelta('5d'), direction='forward').dropna()

And the resulting df as:

News_Dates            News                                                Dates      Price
2018-09-29     Huge blow to ABC Corp. as they lost the 2012 tax case    2018-10-04    120
2018-09-30     ABC Corp. suffers a loss                                 2018-10-04    120 
2018-10-01     ABC Corp to Sell stakes                                  2018-10-04    120
2018-12-22     Shares are down massively for ABC Corp.                  2018-12-24    131

Both the merging results in separate dfs, however there are values in both the cases which are missing, like for second case for 4th October price, news from 29th, 30th Sept should have been merged. And in case 2 for 24th December price 20th December should also have been merged.

So I'm not quite able to figure out where am I going wrong.

P.S. My objective is to merge the price df with the news df that have come in the last 5 days from the price date.

Upvotes: 0

Views: 3351

Answers (2)

Dev Khadka
Dev Khadka

Reputation: 5451

Here is my solution using numpy

df_n = pd.DataFrame([('2018-09-29', 'Huge blow to ABC Corp. as they lost the 2012 tax case'), ('2018-09-30', 'ABC Corp. suffers a loss'), ('2018-10-01', 'ABC Corp to Sell stakes'), ('2018-12-20', 'We are going to comeback strong said ABC CEO'), ('2018-12-22', 'Shares are down massively for ABC Corp.')], columns=('News_Dates', 'News'))
df1_zscore = pd.DataFrame([('2018-10-04', '120'), ('2018-12-24', '131')], columns=('Dates', 'Price'))

df_n["News_Dates"] = pd.to_datetime(df_n["News_Dates"])
df1_zscore["Dates"] = pd.to_datetime(df1_zscore["Dates"])
n_dates = df_n["News_Dates"].values
p_dates = df1_zscore[["Dates"]].values

## substract each pair of n_dates and p_dates and create a matrix
mat_date_compare = (p_dates - n_dates).astype('timedelta64[D]')

## get matrix of boolean for which difference is between 0 and 5 day
## to be used as index for original array
comparision =  (mat_date_compare <= pd.Timedelta("5d")) & (mat_date_compare >= pd.Timedelta("0d"))

## get cell numbers which is in range 0 to matrix size which meets the condition
ind = np.arange(len(n_dates)*len(p_dates))[comparision.ravel()]


## calculate row and column index from cell number to index the df
pd.concat([df1_zscore.iloc[ind//len(n_dates)].reset_index(drop=True), 
           df_n.iloc[ind%len(n_dates)].reset_index(drop=True)], sort=False, axis=1)

Result

Dates   Price   News_Dates  News
0   2018-10-04  120 2018-09-29  Huge blow to ABC Corp. as they lost the 2012 t...
1   2018-10-04  120 2018-09-30  ABC Corp. suffers a loss
2   2018-10-04  120 2018-10-01  ABC Corp to Sell stakes
3   2018-12-24  131 2018-12-20  We are going to comeback strong said ABC CEO
4   2018-12-24  131 2018-12-22  Shares are down massively for ABC Corp.

Upvotes: 0

manwithfewneeds
manwithfewneeds

Reputation: 1167

You can swap the left and right dataframe:

df = pd.merge_asof(
        df1,
        df2,
        left_on='News_Dates',
        right_on='Dates',
        tolerance=pd.Timedelta('5D'),
        direction='nearest'
    )

df = df[['Dates', 'News_Dates', 'News', 'Price']]
print(df)

        Dates News_Dates                                               News Price
0 2018-10-04 2018-09-29  Huge blow to ABC Corp. as they lost the 2012 t... 120
1 2018-10-04 2018-09-30                           ABC Corp. suffers a loss 120
2 2018-10-04 2018-10-01                            ABC Corp to Sell stakes 120
3 2018-12-24 2018-12-20       We are going to comeback strong said ABC CEO 131
4 2018-12-24 2018-12-22            Shares are down massively for ABC Corp. 131

Upvotes: 2

Related Questions