Eric D. Brown D.Sc.
Eric D. Brown D.Sc.

Reputation: 1956

Comparing Pandas Dataframe Rows & Dropping rows with overlapping dates

I have a dataframe filled with trades taken from a trading strategy. The logic in the trading strategy needs to be updated to ensure that trade isn't taken if the strategy is already in a trade - but that's a different problem. The trade data for many previous trades is read into a dataframe from a csv file.

Here's my problem for the data I have: I need to do a row-by-row comparison of the dataframe to determine if Entrydate of rowX is less than ExitDate rowX-1.

A sample of my data:

Row 1:
EntryDate  ExitDate
2012-07-25 2012-07-27 

Row 2:
EntryDate  ExitDate
2012-07-26 2012-07-29

Row 2 needs to be deleted because it is a trade that should not have occurred.

I'm having trouble identifying which rows are duplicates and then dropping them. I tried the approach in answer 3 of this question with some luck but it isn't ideal because I have to manually iterate through the dataframe and read each row's data. My current approach is below and is ugly as can be. I check the dates, and then add them to a new dataframe. Additionally, this approach gives me multiple duplicates in the final dataframe.

for i in range(0,len(df)+1):
    if i+1 == len(df): break #to keep from going past last row
    ExitDate = df['ExitDate'].irow(i)
    EntryNextTrade = df['EntryDate'].irow(i+1)

    if EntryNextTrade>ExitDate: 
        line={'EntryDate':EntryDate,'ExitDate':ExitDate}
        df_trades=df_trades.append(line,ignore_index=True)

Any thoughts or ideas on how to more efficiently accomplish this?

You can click here to see a sampling of my data if you want to try to reproduce my actual dataframe.

Upvotes: 7

Views: 4039

Answers (1)

Andy Hayden
Andy Hayden

Reputation: 375535

You should use some kind of boolean mask to do this kind of operation.

One way is to create a dummy column for the next trade:

df['EntryNextTrade'] = df['EntryDate'].shift()

Use this to create the mask:

msk = df['EntryNextTrade'] > df'[ExitDate']

And use loc to look at the subDataFrame where msk is True, and only the specified columns:

df.loc[msk, ['EntryDate', 'ExitDate']]

Upvotes: 11

Related Questions