Reputation:
Hi I'm new to python and currently using python version 3.x. I have a very large set of data needed to be filtered in csv. I searched online and many recommended loading it into pandas DataFrame (done).
My columns can be defined as: "ID", "Name", "Time", "Token", "Text"
I need to check under "Token" for any duplicates - which can be done via
df = df[df.Token.duplicate(keep=False)]
(Please correct me if I am wrong)
But the problem is, I need to keep the original row while dropping the other duplicates. For this, I was told to compare it with "Time". The "Time" with the smallest value will be original (keep) while drop the rest of the duplicates.
For example:
ID Name Time Token Text
1 | John | 333 | Hello | xxxx
2 | Mary | 233 | Hiiii | xxxx
3 | Jame | 222 | Hello | xxxx
4 | Kenn | 555 | Hello | xxxx
Desired output:
2 | Mary | 233 | Hiiii | xxxx
3 | Jame | 222 | Hello | xxxx
What I have done:
##compare and keep the smaller value
def dups(df):
return df[df["Time"] < df["Time"]]
df = df[df.Token.duplicate()].apply(dups)
This is roughly where I am stuck! Can anyone help? Its my first time coding in python, any help will be greatly appreciated.
Upvotes: 3
Views: 2367
Reputation: 403278
Use sort_values
+ drop_duplicates
:
df = df.sort_values('Time')\
.drop_duplicates('Token', keep='first').sort_index()
df
ID Name Time Token Text
1 2 Mary 233 Hiiii xxxx
2 3 Jame 222 Hello xxxx
The final sort_index
call restores order to your original dataframe. If you want to retrieve a monotonically increasing index beyond this point, call reset_index
.
Upvotes: 1