Reputation: 1161
I have a panda data frame (after concatenating two dataframes) which has some duplicate rows, except two columns, where some row identifier are written. E.g.
A B C D E F
Peter 1 c d e f
Paula 2 g h i j
Frank 3 c d e f
Robert 4 k l m n
Sarah 5 g h i j
use for testing:
df= pd.DataFrame({"A":["Peter", "Paula", "Frank", "Robert", "Sara"],
"B":[1,2,3,4,5],
"C":["c","g","c","k","g"],
"D":["d","h","d","l","h"],
"E":["e","i","e","m","i"],
"F":["f","j","f","n","j"]})
I want to keep only the first appearance of the duplicates in the colums C to F and keep the name and number (columns "A" and "B") of that rows. Thus, we would gain
A B C D E F
Peter 1 c d e f
Paula 2 g h i j
Robert 4 k l m n
I tried something out with df.drop_duplicates, but that does not work with excluding rows "A" and "B". Also, when separating into two dataframes with A and B, and C to D, drop_duplicate, and merge afterwards via the index does not work, since drop_duplicates resets the index. So, how to achieve that? Thank you.
Upvotes: 0
Views: 361
Reputation: 566
df2 = df.drop_duplicates(subset=["C", "D", "E", "F"])
Output:
A B C D E F
0 Peter 1 c d e f
1 Paula 2 g h i j
3 Robert 4 k l m n
See here.
Upvotes: 1