Rakesh Adhikesavan
Rakesh Adhikesavan

Reputation: 12826

How to work with duplicate entries in Pandas Data Frame?

I have a df with the following entries

ID            FIRST_NAME    FIRST_SUBJECT     SECOND_SUBJECT
A2035559    Sometsdf        Science
A2035559    Sometsdf        ENGINEERING
A20340619   Nsdsjes         MATHS
A20340619   Nsdsjes         SCIENCE

I want to identify the duplicate row by the column 'ID' and delete the duplicate row, but move the 'FIRST_SUBJECT' from the deleted row to the 'SECOND_SUBJECT' column of the original row. So that I have this.

ID            FIRST_NAME    FIRST_SUBJECT     SECOND_SUBJECT
A2035559    Sometsdf        Science            ENGINEERING
A20340619   Nsdsjes         MATHS              SCIENCE

This seems very tricky to me, I started by trying to sort the dataFrame by 'ID' but all my ID's start with 'A', so that did not work. How do I go about achieving this?

I had another idea that I am trying,

I create two copies of the dataFrame, df1 and df2. Since, I have only a maximum of one duplicate (ie. two copies of the same row), I drop duplicates by taking the last and the first in df1 and df2 respectively and then attempt to merge the two.

  df1 = df.drop_duplicates('ID' , take_last=False)
  df2 = df.drop_duplicates('ID' , take_last=True)

  df1['SECOND_SUBJECT'] = df2['FIRST_SUBJECT']

Would this work?

Upvotes: 2

Views: 246

Answers (1)

Robert Horvick
Robert Horvick

Reputation: 4036

I'm not really a Python or pandas developer so don't take this as the "right" way (I'm sure it's not - and I'm kind of skeptical of this approach of first/second subject as opposed to a more general mapping schema) - this won't scale to 3+ subjects well.

data = {
    'ID': ['A2035559', 'A20340619', 'A2035559', 'A20340619'],
    'FIRST_NAME': ['Sometsdf', 'Nsdsjes', 'Sometsdf', 'Nsdsjes'],
    'FIRST_SUBJECT': ['SCIENCE', 'MATHS', 'ENGINEERING', 'SCIENCE'],
    'SECOND_SUBJECT': [None, None, None, None]
}

d = pandas.DataFrame(data=data, columns=['ID', 'FIRST_NAME', 'FIRST_SUBJECT', 'SECOND_SUBJECT'])

dup_first = d.drop_duplicates(subset=['ID'], take_last=True)
dup_last = d.drop_duplicates(subset=['ID'], take_last=False)

for row in dup_last['ID']:
    dup_first.loc[dup_first['ID'] == row, 'SECOND_SUBJECT'] = dup_last.loc[dup_last['ID'] == row, 'FIRST_SUBJECT'].values[0]

dup_first

Produces

ID         FIRST_NAME  FIRST_SUBJECT  SECOND_SUBJECT
A2035559   Sometsdf    ENGINEERING    SCIENCE
A20340619  Nsdsjes     SCIENCE        MATHS

Upvotes: 1

Related Questions