Reputation: 1
My dataset is unnecessarily long due to poor back end management so I have a lot of duplicates that I want to delete.
My data includes bats landing on sensors; I have sites (str), individuals (int), dates, and times. I only want to keep the arrival log of each individual, and not all of the duplicates until the bat departs from the sensor. I also want to keep all arrival logs that are at least 15 minutes separated from the last departure of that same individual, at the same site and on the same day.
I have created new lists for each site
Upvotes: -3
Views: 31
Reputation: 22420
Could you just use a loop after grouping?
In [1]: import pandas as pd
...: from io import StringIO
...: from datetime import timedelta
In [2]: csv_data = """Site,ID,Date,Time
...: vc,1111,2024-05-10,1:58:58
...: vc,1111,2024-05-10,1:58:58
...: vc,1111,2024-05-10,1:58:58
...: vc,1111,2024-05-10,1:58:58
...: vc,1111,2024-05-10,2:10:03
...: vc,1111,2024-05-10,2:10:03
...: nm,1111,2024-05-11,12:33:14
...: nm,1111,2024-05-11,12:34:00
...: nm,2222,2024-05-11,6:17:08
...: nm,2222,2024-05-11,9:15:19
...: """
In [3]: df = pd.read_csv(StringIO(csv_data))
In [4]: df
Out[4]:
Site ID Date Time
0 vc 1111 2024-05-10 1:58:58
1 vc 1111 2024-05-10 1:58:58
2 vc 1111 2024-05-10 1:58:58
3 vc 1111 2024-05-10 1:58:58
4 vc 1111 2024-05-10 2:10:03
5 vc 1111 2024-05-10 2:10:03
6 nm 1111 2024-05-11 12:33:14
7 nm 1111 2024-05-11 12:34:00
8 nm 2222 2024-05-11 6:17:08
9 nm 2222 2024-05-11 9:15:19
In [5]: df['Datetime'] = pd.to_datetime(df['Date'] + ' ' + df['Time'])
In [6]: df
Out[6]:
Site ID Date Time Datetime
0 vc 1111 2024-05-10 1:58:58 2024-05-10 01:58:58
1 vc 1111 2024-05-10 1:58:58 2024-05-10 01:58:58
2 vc 1111 2024-05-10 1:58:58 2024-05-10 01:58:58
3 vc 1111 2024-05-10 1:58:58 2024-05-10 01:58:58
4 vc 1111 2024-05-10 2:10:03 2024-05-10 02:10:03
5 vc 1111 2024-05-10 2:10:03 2024-05-10 02:10:03
6 nm 1111 2024-05-11 12:33:14 2024-05-11 12:33:14
7 nm 1111 2024-05-11 12:34:00 2024-05-11 12:34:00
8 nm 2222 2024-05-11 6:17:08 2024-05-11 06:17:08
9 nm 2222 2024-05-11 9:15:19 2024-05-11 09:15:19
In [7]: sorted_df = df.sort_values(by=['Site', 'ID', 'Datetime']).reset_index(drop=True)
In [8]: sorted_df
Out[8]:
Site ID Date Time Datetime
0 nm 1111 2024-05-11 12:33:14 2024-05-11 12:33:14
1 nm 1111 2024-05-11 12:34:00 2024-05-11 12:34:00
2 nm 2222 2024-05-11 6:17:08 2024-05-11 06:17:08
3 nm 2222 2024-05-11 9:15:19 2024-05-11 09:15:19
4 vc 1111 2024-05-10 1:58:58 2024-05-10 01:58:58
5 vc 1111 2024-05-10 1:58:58 2024-05-10 01:58:58
6 vc 1111 2024-05-10 1:58:58 2024-05-10 01:58:58
7 vc 1111 2024-05-10 1:58:58 2024-05-10 01:58:58
8 vc 1111 2024-05-10 2:10:03 2024-05-10 02:10:03
9 vc 1111 2024-05-10 2:10:03 2024-05-10 02:10:03
In [9]: deduplicated_df = sorted_df.drop_duplicates(subset=['Site', 'ID', 'Datetime']).reset_index(
...: drop=True)
In [10]: deduplicated_df
Out[10]:
Site ID Date Time Datetime
0 nm 1111 2024-05-11 12:33:14 2024-05-11 12:33:14
1 nm 1111 2024-05-11 12:34:00 2024-05-11 12:34:00
2 nm 2222 2024-05-11 6:17:08 2024-05-11 06:17:08
3 nm 2222 2024-05-11 9:15:19 2024-05-11 09:15:19
4 vc 1111 2024-05-10 1:58:58 2024-05-10 01:58:58
5 vc 1111 2024-05-10 2:10:03 2024-05-10 02:10:03
In [11]: filtered_rows = []
...: for _, group in deduplicated_df.groupby(['Site', 'ID', 'Date']):
...: last_time = None
...: for _, row in group.iterrows():
...: if last_time is None or (row['Datetime'] - last_time) > timedelta(minutes=15):
...: filtered_rows.append(row)
...: last_time = row['Datetime']
...: filtered_df = pd.DataFrame(filtered_rows).reset_index(drop=True)
In [12]: filtered_df
Out[12]:
Site ID Date Time Datetime
0 nm 1111 2024-05-11 12:33:14 2024-05-11 12:33:14
1 nm 2222 2024-05-11 6:17:08 2024-05-11 06:17:08
2 nm 2222 2024-05-11 9:15:19 2024-05-11 09:15:19
3 vc 1111 2024-05-10 1:58:58 2024-05-10 01:58:58
Upvotes: 0