Shannon Underwood
Shannon Underwood

Reputation: 1

Remove duplicates in a list based on set variables and separated by 15 minutes

My dataset is unnecessarily long due to poor back end management so I have a lot of duplicates that I want to delete.

My data includes bats landing on sensors; I have sites (str), individuals (int), dates, and times. I only want to keep the arrival log of each individual, and not all of the duplicates until the bat departs from the sensor. I also want to keep all arrival logs that are at least 15 minutes separated from the last departure of that same individual, at the same site and on the same day.

example of dataset

I have created new lists for each site

Upvotes: -3

Views: 31

Answers (1)

Sash Sinha
Sash Sinha

Reputation: 22420

Could you just use a loop after grouping?

In [1]: import pandas as pd
   ...: from io import StringIO
   ...: from datetime import timedelta

In [2]: csv_data = """Site,ID,Date,Time
   ...: vc,1111,2024-05-10,1:58:58
   ...: vc,1111,2024-05-10,1:58:58
   ...: vc,1111,2024-05-10,1:58:58
   ...: vc,1111,2024-05-10,1:58:58
   ...: vc,1111,2024-05-10,2:10:03
   ...: vc,1111,2024-05-10,2:10:03
   ...: nm,1111,2024-05-11,12:33:14
   ...: nm,1111,2024-05-11,12:34:00
   ...: nm,2222,2024-05-11,6:17:08
   ...: nm,2222,2024-05-11,9:15:19
   ...: """

In [3]: df = pd.read_csv(StringIO(csv_data))

In [4]: df
Out[4]: 
  Site    ID        Date      Time
0   vc  1111  2024-05-10   1:58:58
1   vc  1111  2024-05-10   1:58:58
2   vc  1111  2024-05-10   1:58:58
3   vc  1111  2024-05-10   1:58:58
4   vc  1111  2024-05-10   2:10:03
5   vc  1111  2024-05-10   2:10:03
6   nm  1111  2024-05-11  12:33:14
7   nm  1111  2024-05-11  12:34:00
8   nm  2222  2024-05-11   6:17:08
9   nm  2222  2024-05-11   9:15:19

In [5]: df['Datetime'] = pd.to_datetime(df['Date'] + ' ' + df['Time'])

In [6]: df
Out[6]: 
  Site    ID        Date      Time            Datetime
0   vc  1111  2024-05-10   1:58:58 2024-05-10 01:58:58
1   vc  1111  2024-05-10   1:58:58 2024-05-10 01:58:58
2   vc  1111  2024-05-10   1:58:58 2024-05-10 01:58:58
3   vc  1111  2024-05-10   1:58:58 2024-05-10 01:58:58
4   vc  1111  2024-05-10   2:10:03 2024-05-10 02:10:03
5   vc  1111  2024-05-10   2:10:03 2024-05-10 02:10:03
6   nm  1111  2024-05-11  12:33:14 2024-05-11 12:33:14
7   nm  1111  2024-05-11  12:34:00 2024-05-11 12:34:00
8   nm  2222  2024-05-11   6:17:08 2024-05-11 06:17:08
9   nm  2222  2024-05-11   9:15:19 2024-05-11 09:15:19

In [7]: sorted_df = df.sort_values(by=['Site', 'ID', 'Datetime']).reset_index(drop=True)

In [8]: sorted_df
Out[8]: 
  Site    ID        Date      Time            Datetime
0   nm  1111  2024-05-11  12:33:14 2024-05-11 12:33:14
1   nm  1111  2024-05-11  12:34:00 2024-05-11 12:34:00
2   nm  2222  2024-05-11   6:17:08 2024-05-11 06:17:08
3   nm  2222  2024-05-11   9:15:19 2024-05-11 09:15:19
4   vc  1111  2024-05-10   1:58:58 2024-05-10 01:58:58
5   vc  1111  2024-05-10   1:58:58 2024-05-10 01:58:58
6   vc  1111  2024-05-10   1:58:58 2024-05-10 01:58:58
7   vc  1111  2024-05-10   1:58:58 2024-05-10 01:58:58
8   vc  1111  2024-05-10   2:10:03 2024-05-10 02:10:03
9   vc  1111  2024-05-10   2:10:03 2024-05-10 02:10:03

In [9]: deduplicated_df = sorted_df.drop_duplicates(subset=['Site', 'ID', 'Datetime']).reset_index(
   ...: drop=True)

In [10]: deduplicated_df
Out[10]: 
  Site    ID        Date      Time            Datetime
0   nm  1111  2024-05-11  12:33:14 2024-05-11 12:33:14
1   nm  1111  2024-05-11  12:34:00 2024-05-11 12:34:00
2   nm  2222  2024-05-11   6:17:08 2024-05-11 06:17:08
3   nm  2222  2024-05-11   9:15:19 2024-05-11 09:15:19
4   vc  1111  2024-05-10   1:58:58 2024-05-10 01:58:58
5   vc  1111  2024-05-10   2:10:03 2024-05-10 02:10:03

In [11]: filtered_rows = []
    ...: for _, group in deduplicated_df.groupby(['Site', 'ID', 'Date']):
    ...:     last_time = None
    ...:     for _, row in group.iterrows():
    ...:         if last_time is None or (row['Datetime'] - last_time) > timedelta(minutes=15):
    ...:             filtered_rows.append(row)
    ...:         last_time = row['Datetime']
    ...: filtered_df = pd.DataFrame(filtered_rows).reset_index(drop=True)

In [12]: filtered_df
Out[12]: 
  Site    ID        Date      Time            Datetime
0   nm  1111  2024-05-11  12:33:14 2024-05-11 12:33:14
1   nm  2222  2024-05-11   6:17:08 2024-05-11 06:17:08
2   nm  2222  2024-05-11   9:15:19 2024-05-11 09:15:19
3   vc  1111  2024-05-10   1:58:58 2024-05-10 01:58:58

Upvotes: 0

Related Questions