Eric D. Brown D.Sc.
Eric D. Brown D.Sc.

Reputation: 1956

Removing 'overlapping' dates from pandas dataframe

I have a pandas dataframe that looks like the following:

ID  date       close
1   09/15/07   123.45
2   06/01/08   130.13
3   10/25/08   132.01
4   05/13/09   118.34
5   11/07/09   145.99
6   11/15/09   146.73
7   07/03/11   171.10

I want to remove any rows that overlap.

Overlapping rows is defined as any row within X days of another row. For example, if X = 365. then the result should be:

ID  date       close
1   09/15/07   123.45
3   10/25/08   132.01
5   11/07/09   145.99
7   07/03/11   171.10

If X = 50, the result should be:

ID  date       close
1   09/15/07   123.45
2   06/01/08   130.13
3   10/25/08   132.01
4   05/13/09   118.34
5   11/07/09   145.99
7   07/03/11   171.10

I've taken a look at a few questions on here but haven't find the right approach. For example, Pandas check for overlapping dates in multiple rows and Fastest way to eliminate specific dates from pandas dataframe are similar but don't quite get me what I need.

I have the following ugly code in place today that works for small X values but when X gets larger (e.g., when X = 365), it removes all dates except the original date.

filter_dates = []
for index, row in df.iterrows():
     if observation_time == 'D':
        for i in range(1, observation_period):
            filter_dates.append((index.date() + timedelta(days=i)))
df = df[~df.index.isin(filter_dates)]

Any help/pointers would be appreciated!

Clarification:

The solution to this needs to look at every row, not just the first row.

Upvotes: 5

Views: 2722

Answers (5)

Eric D. Brown D.Sc.
Eric D. Brown D.Sc.

Reputation: 1956

For anyone looking for the answer that worked for me, here it is (building on @Quickbeam2k1's answer):

X = 50 #or whatever value you want
remove_ids = []
last_day = df.loc[0, "date"]
for index, row in df[1:].iterrows():
    if np.busday_count(last_day, df.loc[index, "date"]) < X: 
        remove_ids.append(index)
    else:
        last_day = df.loc[index, "date"]

Upvotes: 0

Cory Madden
Cory Madden

Reputation: 5193

I found yet another solution to this(you can review edit history if you want to see the old ones). This is the best solution I've come up with. It still keeps the first sequential record, but it can be tweaked to keep the record that comes chronologically first(provided at the end).

target = df.iloc[0]  # Get the first item in the dataframe
day_diff = abs(target.date - df.date)  # Get the differences of all the other dates from the first item
day_diff = day_diff.reset_index().sort_values(['date', 'index'])  # Reset the index and then sort by date and original index so we can maintain the order of the dates
day_diff.columns = ['old_index', 'date']  # rename old index column because of name clash
good_ids = day_diff.groupby(day_diff.date.dt.days // days).first().old_index.values  # Group the dates by range and then get the first item from each group
df.iloc[good_ids]

Once again, I performed some tests to compare it to QuickBeam's method. DataFrame's used were a 600,000 rows randomly sorted and an ordered by date DataFrame with 73,000 rows:

My Method:

DataFrame             days           time
600k/random            2             1 loop, best of 3: 5.03 s per loop
ordered                2             1 loop, best of 3: 564 ms per loop


600k/random            50            1 loop, best of 3: 5.17 s per loop
ordered                50            1 loop, best of 3: 583 ms per loo


600k/random            365           1 loop, best of 3: 5.16 s per loop
ordered                365           1 loop, best of 3: 577 ms per loop

QuickBeam's Method:

DataFrame             days           time

600k/random            2             1 loop, best of 3: 52.8 s per loop
ordered                2             1 loop, best of 3: 4.89 s per loop


600k/random            50            1 loop, best of 3: 53 s per loop
ordered                50            1 loop, best of 3: 4.53 s per loop

600k/random            365           1 loop, best of 3: 53.7 s per loop
ordered                365           1 loop, best of 3: 4.49 s per loop

So yeah, maybe I'm a little competitive...

Exact functions used for testing:

def my_filter(df, days):
    target = df.iloc[0]
    day_diff = abs(target.date - df.date)
    day_diff = day_diff.reset_index().sort_values(['date', 'index'])
    day_diff.columns = ['old_index', 'date']
    good_ids = day_diff.groupby(day_diff.date.dt.days // days).first().old_index.values
    return df.iloc[good_ids]

def quickbeam_filter(df, days):
    filter_ids = [0]
    last_day = df.loc[0, "date"]
    for index, row in df[1:].iterrows():
         if (row["date"] - last_day).days > days:
             filter_ids.append(index)
             last_day = row["date"]
    return df.loc[filter_ids,:]

In case you want to get all the dates that start in a certain range, which makes more sense to me, you could use this version:

def my_filter(df, days):
    target = df.iloc[0]
    day_diff = abs(target.date - df.date)
    day_diff = day_diff.sort_values('date')
    good_ids = day_diff.groupby(day_diff.date.dt.days // days).first().index.values
    return df.iloc[good_ids]

Upvotes: 2

Quickbeam2k1
Quickbeam2k1

Reputation: 5437

I just used an elementary approach (essentially it's a tweaked version of the OP's approach), no fancy numpy or pandas ops but linear instead of quadratic complexity (when compard to the distance matrix approach).
However (as Cory Madden), I assume that the data is sorted with respect to the date column. I hope it's correct:

Dataframe -> I'm using pandas index here:

import pandas as pd
df = pd.DataFrame({'date': ["2007-09-15","2008-06-01","2008-10-25",
                            "2009-05-13","2009-11-07", "2009-11-15", "2011-07-03"],
                   'close':[123.45, 130.13, 132.01, 118.34, 
                            145.99, 146.73, 171.10]})
df["date"]=pd.to_datetime(df["date"])

The following block of code can easily be wrappen in a function and computs the correct dataframe indexes for X=365:

X = 365
filter_ids = [0]
last_day = df.loc[0, "date"]
for index, row in df[1:].iterrows():
     if (row["date"] - last_day).days > X:
         filter_ids.append(index)
         last_day = row["date"]

and the result:

print(df.loc[filter_ids,:])
    close       date
0  123.45 2007-09-15
2  132.01 2008-10-25
4  145.99 2009-11-07
6  171.10 2011-07-03

note that the indexes are shifted by one due to the index starting from zero.


I just wanted to comment on linear versus quadtratic complexity My solution has linear time complexity, seeing every row of the data frame exactly once. Cory maddens solution has quadratic complexity: in each iteration every row of the data frame is accessed. However, if X (the day difference) is large, we might discard a huge part of the data set end only perform very few iterations.

To this end, one might want to consider the following worst case scenario for X=2 the data set:

df = pd.DataFrame({'date':pd.date_range(start='01.01.1900', end='01.01.2100', freq='D')})

On my machine the following codes yield:

%%timeit
X = 2
filter_ids = [0]
last_day = df.loc[0, "date"]
for index, row in df[1:].iterrows():
    if (row["date"] -last_day).days > X:
        filter_ids.append(index)
        last_day = row["date"]
1 loop, best of 3: 7.06 s per loop

and

day_diffs = abs(df.iloc[0].date - df.date).dt.days
i = 0
days = 2
idx = day_diffs.index[i]
good_ids = {idx}
while True:
    try:
        current_row = day_diffs[idx] 
        day_diffs = day_diffs.iloc[1:]
        records_not_overlapping = (day_diffs - current_row) > days         
        idx = records_not_overlapping[records_not_overlapping == True].index[0] 
        good_ids.add(idx)
except IndexError:  
    break
1 loop, best of 3: 3min 16s per loop

Upvotes: 0

DJK
DJK

Reputation: 9264

My approach would be to first compute a distance matrix

distM = np.array([[np.timedelta64(abs(x-y),'D').astype(int) for y in df.date] for x in df.date])

In you example this would look like this

[[   0  260  406  606  784  792 1387]
 [ 260    0  146  346  524  532 1127]
 [ 406  146    0  200  378  386  981]
 [ 606  346  200    0  178  186  781]
 [ 784  524  378  178    0    8  603]
 [ 792  532  386  186    8    0  595]
 [1387 1127  981  781  603  595    0]]

Since were iterating downwards, we only care about distance from the top triangle, so we modify the array by keeping the top and setting the min of 365 to a large number M in this case I'll use 10,000

distM[np.triu(distM) <= 365] = 10000

Then take the argmin across the new distance matrix, to decide which rows of the dataframe to keep.

remove = np.unique(np.argmin(distM,axis=1))
df = df.iloc[remove,:]

and all together...

distM = np.array([[np.timedelta64(abs(x-y),'D').astype(int) for y in df.date] for x in df.date])

distM[np.triu(distM)<= 365] = 10000

remove = np.unique(np.argmin(distM,axis=1))

df = df.iloc[remove,:]

Upvotes: 1

zipa
zipa

Reputation: 27869

You can add new column to filter the results:

df['filter'] = df['date'] - df['date'][0]
df['filter'] = df['filter'].apply(lambda x: x.days)

Then to filter by 365 use this:

df[df['filter']%365==0]

Upvotes: 3

Related Questions