Find closest preceding value in different columns of Pandas DataFrame

Question

I am trying to find a way to find a matching value, given a specific column value, in the nearest preceding rows of two separate columns of a Pandas Dataframe, and subsequently indicate '1' if found in the column else '0'.

The Dataframe index is not sorted.

Data:

df = pd.DataFrame({
  'datetime': [
      '2020-11-16 01:39:06.22021017', '2020-11-16 01:39:06.22021020', '2020-11-16 01:39:06.22021022',
      '2020-11-16 01:39:06.22021031', '2020-11-16 01:39:06.22021033', '2020-11-16 01:39:06.22021036'],
  'type': ['Quote', 'Trade', 'Trade', 'Quote', 'Quote', 'Trade'],
  'price': ['NaN', 7026.5, 7026.5, np.NaN, np.NaN, 7024.0], 
  'ask_price': [7026.5, 7026.5, 7026.0, 7026.5, 7026.0, 7026.5], 
  'bid_price': [7024.0, 7024.5, 7024.5, 7024.0, 7024.5, 7024.5]})

What I need:

When the type == 'Trade' I need to look back through the bid_price and ask_price, and find the first value that matches the column price. In the same row as the one with the trade I want two separate columns indicating whether the price was found in the nearest bid_price or ask_price columns.

Expected Output:

df = pd.DataFrame({
  'datetime': [
      '2020-11-16 01:39:06.22021017', '2020-11-16 01:39:06.22021020', '2020-11-16 01:39:06.22021022',
      '2020-11-16 01:39:06.22021033', '2020-11-16 01:39:06.22021034', '2020-11-16 01:39:06.22021033'],
  'type': ['Quote', 'Trade', 'Trade', 'Quote', 'Quote', 'Trade'],
  'price': ['NaN', 7026.5, 7026.5, np.NaN, np.NaN, 7024.0], 
  'ask_price': [7026.5, 7026.5, 7026.0, 7026.5, 7026.0, 7026.5], 
  'bid_price': [7024.0, 7024.5, 7024.5, 7024.0, 7024.5, 7024.5],
  'is_bid_trade': [0, 0, 0, 0, 0, 1],
  'is_ask_trade': [1, 1, 0, 0, 0, 0]})

You can see that the first trade matches the quote in the preceding row in the ask_price column. The final trade matches in the bid_price column, but this is two rows behind the trade.

I have tried (and have been kindly helped by SO) but have yet to find a solution here.

The datetime column is sadly not 100% accurate, so cannot be relied upon to sort chronologically. I have also attempted to find the minimum index using df.index.get_loc(), but am unsure of how to apply this to two columns to search within.

All help very gratefully received.

Jonathan Leon · Accepted Answer

Here ya go. Note, in your input dataset I changed a string 'NaN' to np.nan to be consistent, and I think your output dataset had a misplaced 1. It's inconsistent as to whether the 1 should go where the trade occurred or on the preceding row. nonetheless, i think this works the way you want with data provided. See comments in the code. If the 1s are supposed to be at the trade row, you can modify the indexing to get the right row.

df = pd.DataFrame({
  'datetime': [
      '2020-11-16 01:39:06.22021017', '2020-11-16 01:39:06.22021020', '2020-11-16 01:39:06.22021022',
      '2020-11-16 01:39:06.22021031', '2020-11-16 01:39:06.22021033', '2020-11-16 01:39:06.22021036'],
  'type': ['Quote', 'Trade', 'Trade', 'Quote', 'Quote', 'Trade'],
  'price': [np.NaN, 7026.5, 7026.5, np.NaN, np.NaN, 7024.0],
  'ask_price': [7026.5, 7026.5, 7026.0, 7026.5, 7026.0, 7026.5],
  'bid_price': [7024.0, 7024.5, 7024.5, 7024.0, 7024.5, 7024.5]})
# you don't have to sort, but reset the index
df.reset_index(drop=True, inplace=True)

# collect the indices where Trade occurred
trade_indices = df.loc[df['type'] == 'Trade'].index.tolist()
# collect corresponding trade price
prices = df['price'].loc[df['price'].notnull()].tolist()
# create a tuple to match the trade row and price
test_tuples = list(zip(trade_indices, prices))
print(test_tuples)
dfo = df # create an output dataframe leaving input df as-is
dfo[['is_bid_trade', 'is_ask_trade']] = 0 # create your new columns with zeroes

# iterate over tuples; this will take full range from 0 up to the row the trade occurred; look for price in either ask or bid price columns, then take the last row (tail(1)). 
# tail(1) will be your most recent row to the trade
for (tradei, price) in test_tuples:
    print(tradei, price)
    # print(df[0:tradei][(df[0:tradei][['ask_price', 'bid_price']] == price).any(axis=1)])
    # print(df[0:tradei][(df[0:tradei][['ask_price', 'bid_price']] == price).any(axis=1)].tail(1))
    dftemp = df[0:tradei][(df[0:tradei][['ask_price', 'bid_price']] == price).any(axis=1)].tail(1)
    # print(dftemp)
    if dftemp.iat[0,3] == price:
        # test if in ask or bid then write to dfo
        dfindex = dftemp.index[0]
        #dfo.at[dfindex, 'is_ask_trade'] = 1
        dfo.at[tradei, 'is_ask_trade'] = 1
    else:
        dfindex = dftemp.index[0]
        #dfo.at[dfindex, 'is_bid_trade'] = 1
        dfo.at[tradei, 'is_ask_trade'] = 1

Output:

In [4]: dfo
Out[4]:
datetime                     type  price  ask_price bid_price is_bid_trade is_ask_trade
2020-11-16 01:39:06.22021017 Quote NaN    7026.5    7024.0    0    0
2020-11-16 01:39:06.22021020 Trade 7026.5 7026.5    7024.5    0    1
2020-11-16 01:39:06.22021022 Trade 7026.5 7026.0    7024.5    0    1
2020-11-16 01:39:06.22021031 Quote NaN    7026.5    7024.0    0    0
2020-11-16 01:39:06.22021033 Quote NaN    7026.0    7024.5    0    0
2020-11-16 01:39:06.22021036 Trade 7024.0 7026.5    7024.5    0    1

Find closest preceding value in different columns of Pandas DataFrame

Answers (1)

Related Questions