Reputation: 343
I have a Dataframe following this structure:
id created_at seen_before
0 1043 2021-11-27 16:56:43 0
1 1027 2021-11-22 19:01:21 0
2 1099 2021-11-22 07:37:02 0
3 1099 2021-11-22 07:36:50 0
4 1099 2021-11-22 07:36:41 0
5 1027 2021-11-22 07:36:39 0
I would like to look at each of the rows, and check if a matching id with an earlier timestamp exists, and indicate it with a 1 in the seen_before column, so the dataframe would be updated as such:
id created_at seen_before
0 1043 2021-11-27 16:56:43 0
1 1027 2021-11-22 19:01:21 1
2 1099 2021-11-22 07:37:02 1
3 1099 2021-11-22 07:36:50 1
4 1099 2021-11-22 07:36:41 0
5 1027 2021-11-22 07:36:39 0
I have made a solution that iterates over each row, and compares the row with the rest of the dataframe as such:
for _, row in df.iterrows():
df.loc[(df['created_at'] > row['created_at']) & (df['id'] == row['id']), 'seen_before'] = 1
This works, but it doesn't seem to be very scalable, as I am dealing with a lot of data, and would like to avoid iterating over each row.
If there exists a more scalable approach in pandas, I would like to know.
Upvotes: 0
Views: 502
Reputation: 120439
As your created_at
column are sorted in descending order, you can simply use:
df['seen_before'] = df.duplicated('id', keep='last').astype(int)
print(df)
# Output
id created_at seen_before
0 1043 2021-11-27 16:56:43 0
1 1027 2021-11-22 19:01:21 1
2 1099 2021-11-22 07:37:02 1
3 1099 2021-11-22 07:36:50 1
4 1099 2021-11-22 07:36:41 0
5 1027 2021-11-22 07:36:39 0
Upvotes: 0
Reputation:
You can groupby
"id" and for "created_at" find min
and transform it for the entire DataFrame. Then compare it with the original "created_at" datetimes to see if any datetime comes after the minimum datetime using lt
; this will create a boolean Series where for each "id", any datetime that comes after the earliest datetime for that "id" will evaluate to True => convert it to dtype int
for the final output.
df['created_at'] = pd.to_datetime(df['created_at'])
df['seen_before'] = df.groupby('id')['created_at'].transform('min').lt(df['created_at']).astype(int)
Output:
id created_at seen_before
0 1043 2021-11-27 16:56:43 0
1 1027 2021-11-22 19:01:21 1
2 1099 2021-11-22 07:37:02 1
3 1099 2021-11-22 07:36:50 1
4 1099 2021-11-22 07:36:41 0
5 1027 2021-11-22 07:36:39 0
Upvotes: 1
Reputation: 1093
You can use lambda function:
df['seen'] = df.apply(lambda row: 1 if row['created_at'] > min(df['created_at'].loc[df['id'] == row['id']]) else 0, axis = 1)
Result:
id created_at seen
0 1043 2021-11-27 16:56:43 0
1 1027 2021-11-22 19:01:21 1
2 1099 2021-11-22 07:37:02 1
3 1099 2021-11-22 07:36:50 1
4 1099 2021-11-22 07:36:41 0
5 1027 2021-11-22 07:36:39 0
Upvotes: 0