Reputation: 1
I have a pandas dataframe with a few thousand rows and multiple columns, one of which contains IDs (integers) and one that contains dates (python datetime objects). Different rows can have identical IDs and/or dates+time. I would like to add a boolean column to the dataframe which indicates for each row if the dataframe contains another row that has an identical ID AND has a date+time that is within 2 months of - BUT NOT identical to - the date of the row itself.
See below for some example data where the "Boolean condition" column is filled correctly.
index | ID | Date | Boolean condition |
---|---|---|---|
1 | 20000242 | 2021-06-18 09:46:13.433 | True |
2 | 20000242 | 2021-06-18 09:46:13.433 | True |
3 | 20000242 | 2021-07-16 11:41:49.557 | True |
4 | 20000242 | 2021-08-13 13:36:30.417 | True |
5 | 20103342 | 2018-10-19 11:10:45.817 | False |
6 | 20103657 | 2019-09-13 10:18:00.173 | True |
7 | 20103657 | 2019-09-13 10:18:00.173 | True |
8 | 20103657 | 2019-09-13 12:16:53.737 | True |
9 | 20103657 | 2019-10-04 14:24:41.067 | True |
10 | 20103657 | 2020-03-03 13:50:32.953 | False |
11 | 20103657 | 2020-03-03 13:50:32.953 | False |
12 | 20103657 | 2020-03-03 13:50:32.953 | False |
13 | 20103657 | 2020-03-03 13:50:32.953 | False |
14 | 20103657 | 2020-03-03 13:50:32.953 | False |
15 | 20105037 | 2020-12-18 15:48:09.187 | False |
16 | 20201854 | 2021-03-05 14:49:11.850 | True |
17 | 20201854 | 2021-04-02 14:07:50.213 | True |
18 | 20201854 | 2021-05-28 08:35:57.363 | True |
19 | 20201854 | 2021-05-28 08:35:57.363 | True |
20 | 20203290 | 2020-02-07 11:13:08.117 | False |
21 | 20203290 | 2020-02-07 11:13:08.117 | False |
22 | 20203290 | 2020-02-07 11:13:08.117 | False |
23 | 20203290 | 2020-02-07 11:13:08.117 | False |
24 | 20203290 | 2020-02-07 11:13:08.117 | False |
25 | 20203290 | 2020-02-07 11:13:08.117 | False |
26 | 20401499 | 2018-07-13 11:10:02.960 | True |
27 | 20401499 | 2018-07-13 11:10:02.960 | True |
28 | 20401499 | 2018-07-13 13:46:52.483 | True |
29 | 20401499 | 2018-07-13 13:46:52.483 | True |
30 | 20402860 | 2019-12-13 11:18:31.253 | False |
I've tried to accomplish this naively with nested loops, but this is not feasible for thousands of rows.
Does anybody know a more efficient / pythonic way to do this?
EDIT: Here is the same data in semi-colon separated values:
index;ID;Date;Boolean condition
1;20000242;2021-06-18 09:46:13.433;True
2;20000242;2021-06-18 09:46:13.433;True
3;20000242;2021-07-16 11:41:49.557;True
4;20000242;2021-08-13 13:36:30.417;True
5;20103342;2018-10-19 11:10:45.817;False
6;20103657;2019-09-13 10:18:00.173;True
7;20103657;2019-09-13 10:18:00.173;True
8;20103657;2019-09-13 12:16:53.737;True
9;20103657;2019-10-04 14:24:41.067;True
10;20103657;2020-03-03 13:50:32.953;False
11;20103657;2020-03-03 13:50:32.953;False
12;20103657;2020-03-03 13:50:32.953;False
13;20103657;2020-03-03 13:50:32.953;False
14;20103657;2020-03-03 13:50:32.953;False
15;20105037;2020-12-18 15:48:09.187;False
16;20201854;2021-03-05 14:49:11.850;True
17;20201854;2021-04-02 14:07:50.213;True
18;20201854;2021-05-28 08:35:57.363;True
19;20201854;2021-05-28 08:35:57.363;True
20;20203290;2020-02-07 11:13:08.117;False
21;20203290;2020-02-07 11:13:08.117;False
22;20203290;2020-02-07 11:13:08.117;False
23;20203290;2020-02-07 11:13:08.117;False
24;20203290;2020-02-07 11:13:08.117;False
25;20203290;2020-02-07 11:13:08.117;False
26;20401499;2018-07-13 11:10:02.960;True
27;20401499;2018-07-13 11:10:02.960;True
28;20401499;2018-07-13 13:46:52.483;True
29;20401499;2018-07-13 13:46:52.483;True
30;20402860;2019-12-13 11:18:31.253;False
Upvotes: 0
Views: 169
Reputation: 648
Pandas indexing is clearer and more efficient than nested loops:
df_duplicated = df[df['ID'].duplicated() == True] # duplicated IDs
another_column_in_range = []
for index,row in df_duplicated.iterrows(): # Iterate over rows
df_id,df_date = row['ID'],row['Date']
catch_df = df_duplicated[df_duplicated['Date'] - df_date < timedelta(days=60)] # Generate a df with the rows that match the condition
if catch_df.empy: # If catch_df is empty, the condition didn't match any row
another_column_in_range.append(False)
else:
another_column_in_range.append(True)
df_duplicated["another_column_in_range"] = another_column_in_range
final_catch = []
for index in len(df):
if df[index]['ID'] in df_duplicated:
if df[index]['ID']["another_column_in_range"] == True:
final_catch.append(True)
break
final_catch.append(False)
df["another_column_in_range"] = final_catch
Upvotes: 3