Rockbar
Rockbar

Reputation: 1161

Merge two dfs with duplicate rows, except in two columns

I have a panda data frame (after concatenating two dataframes) which has some duplicate rows, except two columns, where some row identifier are written. E.g.

  A      B C D E F
  Peter  1 c d e f
  Paula  2 g h i j
  Frank  3 c d e f
  Robert 4 k l m n
  Sarah  5 g h i j

use for testing:

df= pd.DataFrame({"A":["Peter", "Paula", "Frank", "Robert", "Sara"],
                  "B":[1,2,3,4,5],
                  "C":["c","g","c","k","g"],
                  "D":["d","h","d","l","h"],
                  "E":["e","i","e","m","i"],
                  "F":["f","j","f","n","j"]})

I want to keep only the first appearance of the duplicates in the colums C to F and keep the name and number (columns "A" and "B") of that rows. Thus, we would gain

  A      B C D E F
  Peter  1 c d e f
  Paula  2 g h i j
  Robert 4 k l m n

I tried something out with df.drop_duplicates, but that does not work with excluding rows "A" and "B". Also, when separating into two dataframes with A and B, and C to D, drop_duplicate, and merge afterwards via the index does not work, since drop_duplicates resets the index. So, how to achieve that? Thank you.

Upvotes: 0

Views: 361

Answers (1)

Spherical Cowboy
Spherical Cowboy

Reputation: 566

df2 = df.drop_duplicates(subset=["C", "D", "E", "F"])

Output:

        A  B  C  D  E  F
0   Peter  1  c  d  e  f
1   Paula  2  g  h  i  j
3  Robert  4  k  l  m  n

See here.

Upvotes: 1

Related Questions