Python 3.x pandas how to compare duplicates and drop the rows with the higher values in csv?

Question

Hi I'm new to python and currently using python version 3.x. I have a very large set of data needed to be filtered in csv. I searched online and many recommended loading it into pandas DataFrame (done).

My columns can be defined as: "ID", "Name", "Time", "Token", "Text"

I need to check under "Token" for any duplicates - which can be done via

df = df[df.Token.duplicate(keep=False)]

(Please correct me if I am wrong)

But the problem is, I need to keep the original row while dropping the other duplicates. For this, I was told to compare it with "Time". The "Time" with the smallest value will be original (keep) while drop the rest of the duplicates.

For example:

ID Name Time Token Text

1 | John | 333 | Hello | xxxx

2 | Mary | 233 | Hiiii | xxxx

3 | Jame | 222 | Hello | xxxx

4 | Kenn | 555 | Hello | xxxx

Desired output:

2 | Mary | 233 | Hiiii | xxxx

3 | Jame | 222 | Hello | xxxx

What I have done:

    ##compare and keep the smaller value
    def dups(df):
       return df[df["Time"] < df["Time"]]

df = df[df.Token.duplicate()].apply(dups)

This is roughly where I am stuck! Can anyone help? Its my first time coding in python, any help will be greatly appreciated.

cs95 · Accepted Answer

Use sort_values + drop_duplicates:

df = df.sort_values('Time')\
        .drop_duplicates('Token', keep='first').sort_index()
df

   ID  Name  Time  Token  Text
1   2  Mary   233  Hiiii  xxxx
2   3  Jame   222  Hello  xxxx

The final sort_index call restores order to your original dataframe. If you want to retrieve a monotonically increasing index beyond this point, call reset_index.

Python 3.x pandas how to compare duplicates and drop the rows with the higher values in csv?

Answers (1)

Related Questions