gginelli
gginelli

Reputation: 1

Selecting pandas dataframe rows depending on other rows

I have a pandas dataframe with a few thousand rows and multiple columns, one of which contains IDs (integers) and one that contains dates (python datetime objects). Different rows can have identical IDs and/or dates+time. I would like to add a boolean column to the dataframe which indicates for each row if the dataframe contains another row that has an identical ID AND has a date+time that is within 2 months of - BUT NOT identical to - the date of the row itself.

See below for some example data where the "Boolean condition" column is filled correctly.

index ID Date Boolean condition
1 20000242 2021-06-18 09:46:13.433 True
2 20000242 2021-06-18 09:46:13.433 True
3 20000242 2021-07-16 11:41:49.557 True
4 20000242 2021-08-13 13:36:30.417 True
5 20103342 2018-10-19 11:10:45.817 False
6 20103657 2019-09-13 10:18:00.173 True
7 20103657 2019-09-13 10:18:00.173 True
8 20103657 2019-09-13 12:16:53.737 True
9 20103657 2019-10-04 14:24:41.067 True
10 20103657 2020-03-03 13:50:32.953 False
11 20103657 2020-03-03 13:50:32.953 False
12 20103657 2020-03-03 13:50:32.953 False
13 20103657 2020-03-03 13:50:32.953 False
14 20103657 2020-03-03 13:50:32.953 False
15 20105037 2020-12-18 15:48:09.187 False
16 20201854 2021-03-05 14:49:11.850 True
17 20201854 2021-04-02 14:07:50.213 True
18 20201854 2021-05-28 08:35:57.363 True
19 20201854 2021-05-28 08:35:57.363 True
20 20203290 2020-02-07 11:13:08.117 False
21 20203290 2020-02-07 11:13:08.117 False
22 20203290 2020-02-07 11:13:08.117 False
23 20203290 2020-02-07 11:13:08.117 False
24 20203290 2020-02-07 11:13:08.117 False
25 20203290 2020-02-07 11:13:08.117 False
26 20401499 2018-07-13 11:10:02.960 True
27 20401499 2018-07-13 11:10:02.960 True
28 20401499 2018-07-13 13:46:52.483 True
29 20401499 2018-07-13 13:46:52.483 True
30 20402860 2019-12-13 11:18:31.253 False

I've tried to accomplish this naively with nested loops, but this is not feasible for thousands of rows.

Does anybody know a more efficient / pythonic way to do this?

EDIT: Here is the same data in semi-colon separated values:

index;ID;Date;Boolean condition

1;20000242;2021-06-18 09:46:13.433;True

2;20000242;2021-06-18 09:46:13.433;True

3;20000242;2021-07-16 11:41:49.557;True

4;20000242;2021-08-13 13:36:30.417;True

5;20103342;2018-10-19 11:10:45.817;False

6;20103657;2019-09-13 10:18:00.173;True

7;20103657;2019-09-13 10:18:00.173;True

8;20103657;2019-09-13 12:16:53.737;True

9;20103657;2019-10-04 14:24:41.067;True

10;20103657;2020-03-03 13:50:32.953;False

11;20103657;2020-03-03 13:50:32.953;False

12;20103657;2020-03-03 13:50:32.953;False

13;20103657;2020-03-03 13:50:32.953;False

14;20103657;2020-03-03 13:50:32.953;False

15;20105037;2020-12-18 15:48:09.187;False

16;20201854;2021-03-05 14:49:11.850;True

17;20201854;2021-04-02 14:07:50.213;True

18;20201854;2021-05-28 08:35:57.363;True

19;20201854;2021-05-28 08:35:57.363;True

20;20203290;2020-02-07 11:13:08.117;False

21;20203290;2020-02-07 11:13:08.117;False

22;20203290;2020-02-07 11:13:08.117;False

23;20203290;2020-02-07 11:13:08.117;False

24;20203290;2020-02-07 11:13:08.117;False

25;20203290;2020-02-07 11:13:08.117;False

26;20401499;2018-07-13 11:10:02.960;True

27;20401499;2018-07-13 11:10:02.960;True

28;20401499;2018-07-13 13:46:52.483;True

29;20401499;2018-07-13 13:46:52.483;True

30;20402860;2019-12-13 11:18:31.253;False

Upvotes: 0

Views: 169

Answers (1)

Fran Arenas
Fran Arenas

Reputation: 648

Pandas indexing is clearer and more efficient than nested loops:

df_duplicated = df[df['ID'].duplicated() == True] # duplicated IDs

another_column_in_range = []

for index,row in df_duplicated.iterrows(): # Iterate over rows
  df_id,df_date = row['ID'],row['Date']

  catch_df = df_duplicated[df_duplicated['Date'] - df_date < timedelta(days=60)] # Generate a df with the rows that match the condition

  if catch_df.empy: # If catch_df is empty, the condition didn't match any row
    another_column_in_range.append(False)
  else:
    another_column_in_range.append(True)

df_duplicated["another_column_in_range"] = another_column_in_range

final_catch = []
for index in len(df):
    if df[index]['ID'] in df_duplicated:
        if df[index]['ID']["another_column_in_range"] == True:
           final_catch.append(True)
           break
    final_catch.append(False)

df["another_column_in_range"] = final_catch

Upvotes: 3

Related Questions