Reputation: 1956
I have a pandas dataframe that looks like the following:
ID date close
1 09/15/07 123.45
2 06/01/08 130.13
3 10/25/08 132.01
4 05/13/09 118.34
5 11/07/09 145.99
6 11/15/09 146.73
7 07/03/11 171.10
I want to remove any rows that overlap.
Overlapping rows is defined as any row within X days of another row. For example, if X = 365. then the result should be:
ID date close
1 09/15/07 123.45
3 10/25/08 132.01
5 11/07/09 145.99
7 07/03/11 171.10
If X = 50, the result should be:
ID date close
1 09/15/07 123.45
2 06/01/08 130.13
3 10/25/08 132.01
4 05/13/09 118.34
5 11/07/09 145.99
7 07/03/11 171.10
I've taken a look at a few questions on here but haven't find the right approach. For example, Pandas check for overlapping dates in multiple rows and Fastest way to eliminate specific dates from pandas dataframe are similar but don't quite get me what I need.
I have the following ugly code in place today that works for small X values but when X gets larger (e.g., when X = 365), it removes all dates except the original date.
filter_dates = []
for index, row in df.iterrows():
if observation_time == 'D':
for i in range(1, observation_period):
filter_dates.append((index.date() + timedelta(days=i)))
df = df[~df.index.isin(filter_dates)]
Any help/pointers would be appreciated!
Clarification:
The solution to this needs to look at every row, not just the first row.
Upvotes: 5
Views: 2722
Reputation: 1956
For anyone looking for the answer that worked for me, here it is (building on @Quickbeam2k1's answer):
X = 50 #or whatever value you want
remove_ids = []
last_day = df.loc[0, "date"]
for index, row in df[1:].iterrows():
if np.busday_count(last_day, df.loc[index, "date"]) < X:
remove_ids.append(index)
else:
last_day = df.loc[index, "date"]
Upvotes: 0
Reputation: 5193
I found yet another solution to this(you can review edit history if you want to see the old ones). This is the best solution I've come up with. It still keeps the first sequential record, but it can be tweaked to keep the record that comes chronologically first(provided at the end).
target = df.iloc[0] # Get the first item in the dataframe
day_diff = abs(target.date - df.date) # Get the differences of all the other dates from the first item
day_diff = day_diff.reset_index().sort_values(['date', 'index']) # Reset the index and then sort by date and original index so we can maintain the order of the dates
day_diff.columns = ['old_index', 'date'] # rename old index column because of name clash
good_ids = day_diff.groupby(day_diff.date.dt.days // days).first().old_index.values # Group the dates by range and then get the first item from each group
df.iloc[good_ids]
Once again, I performed some tests to compare it to QuickBeam's method. DataFrame's used were a 600,000 rows randomly sorted and an ordered by date DataFrame with 73,000 rows:
My Method:
DataFrame days time
600k/random 2 1 loop, best of 3: 5.03 s per loop
ordered 2 1 loop, best of 3: 564 ms per loop
600k/random 50 1 loop, best of 3: 5.17 s per loop
ordered 50 1 loop, best of 3: 583 ms per loo
600k/random 365 1 loop, best of 3: 5.16 s per loop
ordered 365 1 loop, best of 3: 577 ms per loop
QuickBeam's Method:
DataFrame days time
600k/random 2 1 loop, best of 3: 52.8 s per loop
ordered 2 1 loop, best of 3: 4.89 s per loop
600k/random 50 1 loop, best of 3: 53 s per loop
ordered 50 1 loop, best of 3: 4.53 s per loop
600k/random 365 1 loop, best of 3: 53.7 s per loop
ordered 365 1 loop, best of 3: 4.49 s per loop
So yeah, maybe I'm a little competitive...
Exact functions used for testing:
def my_filter(df, days):
target = df.iloc[0]
day_diff = abs(target.date - df.date)
day_diff = day_diff.reset_index().sort_values(['date', 'index'])
day_diff.columns = ['old_index', 'date']
good_ids = day_diff.groupby(day_diff.date.dt.days // days).first().old_index.values
return df.iloc[good_ids]
def quickbeam_filter(df, days):
filter_ids = [0]
last_day = df.loc[0, "date"]
for index, row in df[1:].iterrows():
if (row["date"] - last_day).days > days:
filter_ids.append(index)
last_day = row["date"]
return df.loc[filter_ids,:]
In case you want to get all the dates that start in a certain range, which makes more sense to me, you could use this version:
def my_filter(df, days):
target = df.iloc[0]
day_diff = abs(target.date - df.date)
day_diff = day_diff.sort_values('date')
good_ids = day_diff.groupby(day_diff.date.dt.days // days).first().index.values
return df.iloc[good_ids]
Upvotes: 2
Reputation: 5437
I just used an elementary approach (essentially it's a tweaked version of the OP's approach), no fancy numpy or pandas ops but linear instead of quadratic complexity (when compard to the distance matrix approach).
However (as Cory Madden), I assume that the data is sorted with respect to the date column.
I hope it's correct:
Dataframe -> I'm using pandas index here:
import pandas as pd
df = pd.DataFrame({'date': ["2007-09-15","2008-06-01","2008-10-25",
"2009-05-13","2009-11-07", "2009-11-15", "2011-07-03"],
'close':[123.45, 130.13, 132.01, 118.34,
145.99, 146.73, 171.10]})
df["date"]=pd.to_datetime(df["date"])
The following block of code can easily be wrappen in a function and computs the correct dataframe indexes for X=365:
X = 365
filter_ids = [0]
last_day = df.loc[0, "date"]
for index, row in df[1:].iterrows():
if (row["date"] - last_day).days > X:
filter_ids.append(index)
last_day = row["date"]
and the result:
print(df.loc[filter_ids,:])
close date
0 123.45 2007-09-15
2 132.01 2008-10-25
4 145.99 2009-11-07
6 171.10 2011-07-03
note that the indexes are shifted by one due to the index starting from zero.
I just wanted to comment on linear versus quadtratic complexity My solution has linear time complexity, seeing every row of the data frame exactly once. Cory maddens solution has quadratic complexity: in each iteration every row of the data frame is accessed. However, if X (the day difference) is large, we might discard a huge part of the data set end only perform very few iterations.
To this end, one might want to consider the following worst case scenario for X=2
the data set:
df = pd.DataFrame({'date':pd.date_range(start='01.01.1900', end='01.01.2100', freq='D')})
On my machine the following codes yield:
%%timeit
X = 2
filter_ids = [0]
last_day = df.loc[0, "date"]
for index, row in df[1:].iterrows():
if (row["date"] -last_day).days > X:
filter_ids.append(index)
last_day = row["date"]
1 loop, best of 3: 7.06 s per loop
and
day_diffs = abs(df.iloc[0].date - df.date).dt.days
i = 0
days = 2
idx = day_diffs.index[i]
good_ids = {idx}
while True:
try:
current_row = day_diffs[idx]
day_diffs = day_diffs.iloc[1:]
records_not_overlapping = (day_diffs - current_row) > days
idx = records_not_overlapping[records_not_overlapping == True].index[0]
good_ids.add(idx)
except IndexError:
break
1 loop, best of 3: 3min 16s per loop
Upvotes: 0
Reputation: 9264
My approach would be to first compute a distance matrix
distM = np.array([[np.timedelta64(abs(x-y),'D').astype(int) for y in df.date] for x in df.date])
In you example this would look like this
[[ 0 260 406 606 784 792 1387]
[ 260 0 146 346 524 532 1127]
[ 406 146 0 200 378 386 981]
[ 606 346 200 0 178 186 781]
[ 784 524 378 178 0 8 603]
[ 792 532 386 186 8 0 595]
[1387 1127 981 781 603 595 0]]
Since were iterating downwards, we only care about distance from the top triangle, so we modify the array by keeping the top and setting the min of 365 to a large number M
in this case I'll use 10,000
distM[np.triu(distM) <= 365] = 10000
Then take the argmin
across the new distance matrix, to decide which rows of the dataframe to keep.
remove = np.unique(np.argmin(distM,axis=1))
df = df.iloc[remove,:]
and all together...
distM = np.array([[np.timedelta64(abs(x-y),'D').astype(int) for y in df.date] for x in df.date])
distM[np.triu(distM)<= 365] = 10000
remove = np.unique(np.argmin(distM,axis=1))
df = df.iloc[remove,:]
Upvotes: 1
Reputation: 27869
You can add new column to filter the results:
df['filter'] = df['date'] - df['date'][0]
df['filter'] = df['filter'].apply(lambda x: x.days)
Then to filter by 365 use this:
df[df['filter']%365==0]
Upvotes: 3