Pandas - Merge nearly duplicate rows filtering the last timestamp

Question

I have a two pandas dataframe with several rows that are near duplicates of each other, except for one value, which is timestamp value. My goal is to merge these dataframes into a single dataframe, and for these nearly repeat rows, get the row with the last timestamp.

Here is an example of what I'm working with:

DF1:

       id               name created_at
0   1  Cristiano Ronaldo 2020-01-20
1   2              Messi 2020-01-20
2   3             Juarez 2020-01-20

DF2:

      id               name created_at
0   1  Cristiano Ronaldo 2020-01-20
1   2              Messi 2020-01-20
2   3             Juarez 2020-02-20

And here is what I would like:

       id               name created_at
3   1  Cristiano Ronaldo 2020-01-20
4   2              Messi 2020-01-20
5   3             Juarez 2020-02-20

For the row Juarez I get the last "created_ad"

Tha is it possible?

Shradha · Accepted Answer

You can append the second dataframe to the first one, sort the dataframe using timestamp and then drop duplicates.

df_merged = df1.append(df2, ignore_index = True)
df_merged = df_merged.sort_values('created_at')
df_columns = df_merged.columns.tolist()
df_columns.remove('created_at')
df_merged.drop_duplicates(inplace = True, keep = 'last', subset = df_columns)

Pandas - Merge nearly duplicate rows filtering the last timestamp

Answers (1)

Related Questions