MDan
MDan

Reputation: 343

Updating pandas row value where ids match, and timestamp is greater than previous records

I have a Dataframe following this structure:

             id          created_at  seen_before
0          1043 2021-11-27 16:56:43            0
1          1027 2021-11-22 19:01:21            0
2          1099 2021-11-22 07:37:02            0
3          1099 2021-11-22 07:36:50            0
4          1099 2021-11-22 07:36:41            0
5          1027 2021-11-22 07:36:39            0

I would like to look at each of the rows, and check if a matching id with an earlier timestamp exists, and indicate it with a 1 in the seen_before column, so the dataframe would be updated as such:

             id          created_at  seen_before
0          1043 2021-11-27 16:56:43            0
1          1027 2021-11-22 19:01:21            1
2          1099 2021-11-22 07:37:02            1
3          1099 2021-11-22 07:36:50            1
4          1099 2021-11-22 07:36:41            0
5          1027 2021-11-22 07:36:39            0

I have made a solution that iterates over each row, and compares the row with the rest of the dataframe as such:

for _, row in df.iterrows():
      df.loc[(df['created_at'] > row['created_at']) & (df['id'] == row['id']), 'seen_before'] = 1

This works, but it doesn't seem to be very scalable, as I am dealing with a lot of data, and would like to avoid iterating over each row.

If there exists a more scalable approach in pandas, I would like to know.

Upvotes: 0

Views: 502

Answers (3)

Corralien
Corralien

Reputation: 120439

As your created_at column are sorted in descending order, you can simply use:

df['seen_before'] = df.duplicated('id', keep='last').astype(int)
print(df)

# Output
     id          created_at  seen_before
0  1043 2021-11-27 16:56:43            0
1  1027 2021-11-22 19:01:21            1
2  1099 2021-11-22 07:37:02            1
3  1099 2021-11-22 07:36:50            1
4  1099 2021-11-22 07:36:41            0
5  1027 2021-11-22 07:36:39            0

Upvotes: 0

user7864386
user7864386

Reputation:

You can groupby "id" and for "created_at" find min and transform it for the entire DataFrame. Then compare it with the original "created_at" datetimes to see if any datetime comes after the minimum datetime using lt; this will create a boolean Series where for each "id", any datetime that comes after the earliest datetime for that "id" will evaluate to True => convert it to dtype int for the final output.

df['created_at'] = pd.to_datetime(df['created_at'])
df['seen_before'] = df.groupby('id')['created_at'].transform('min').lt(df['created_at']).astype(int)

Output:

     id          created_at  seen_before
0  1043 2021-11-27 16:56:43            0
1  1027 2021-11-22 19:01:21            1
2  1099 2021-11-22 07:37:02            1
3  1099 2021-11-22 07:36:50            1
4  1099 2021-11-22 07:36:41            0
5  1027 2021-11-22 07:36:39            0

Upvotes: 1

Inputvector
Inputvector

Reputation: 1093

You can use lambda function:

df['seen']  = df.apply(lambda row: 1 if row['created_at'] > min(df['created_at'].loc[df['id'] == row['id']])  else 0, axis = 1)

Result:


    id      created_at          seen
0   1043    2021-11-27 16:56:43 0
1   1027    2021-11-22 19:01:21 1
2   1099    2021-11-22 07:37:02 1
3   1099    2021-11-22 07:36:50 1
4   1099    2021-11-22 07:36:41 0
5   1027    2021-11-22 07:36:39 0

Upvotes: 0

Related Questions