Jitesh
Jitesh

Reputation: 195

dropping duplicate rows from data frame excluding last two columns python pandas

I have data like this:

id    city  age  gender  Child  Generation_group   Bodytag   RecordType 
2001   A    2      M      0      ABCD               WW        AW
2002   A    2      M      0      ABCD               AA        WA
2002   A    2      M      0      ABCD               BB        AW
2002   A    2      M      0      ABCD               CC        WA
2004   A    2      M      0      ABCD               DD        AW
2004   A    2      M      0      ABCD               EE        WA
2005   A    2      M      0      ABCD               FF        AW
2005   A    2      M      0      ABCD               QQ        WA

I want output like this:

id    city  age  gender  Child  Generation_group   Bodytag   RecordType 
2001   A    2      M      0      ABCD               WW        AW
2002   A    2      M      0      ABCD               AA        WA
2002   Null Null   Null   Null   Null               BB        AW
2002   Null Null   Null   Null   Null               CC        WA
2004   A    2      M      0      ABCD               DD        AW
2004   Null Null   Null   Null   Null               EE        WA
2005   A    2      M      0      ABCD               FF        AW
2005   Null Null   Null   Null   Null               QQ        WA

I want null values where duplicate ids are there but only in first 5 columns after id columns , I don't want null in last two columns

How do I do it in pandas ?

Upvotes: 1

Views: 126

Answers (2)

jezrael
jezrael

Reputation: 863226

You can use mask with duplicated and select columns via []:

cols = df.columns
df[cols[1:-2]] = df[cols[1:-2]].mask(df[cols[:-2]].duplicated())
print (df)
     id city  age gender  Child Generation_group Bodytag RecordType
0  2001    A  2.0      M    0.0             ABCD      WW         AW
1  2002    A  2.0      M    0.0             ABCD      AA         WA
2  2002  NaN  NaN    NaN    NaN              NaN      BB         AW
3  2002  NaN  NaN    NaN    NaN              NaN      CC         WA
4  2004    A  2.0      M    0.0             ABCD      DD         AW
5  2004  NaN  NaN    NaN    NaN              NaN      EE         WA
6  2005    A  2.0      M    0.0             ABCD      FF         AW
7  2005  NaN  NaN    NaN    NaN              NaN      QQ         WA

Similar solution:

cols = df.columns
df[cols[1:-2]] = df[cols[1:-2]].mask(df.duplicated(subset=cols[:-2]))
print (df)
     id city  age gender  Child Generation_group Bodytag RecordType
0  2001    A  2.0      M    0.0             ABCD      WW         AW
1  2002    A  2.0      M    0.0             ABCD      AA         WA
2  2002  NaN  NaN    NaN    NaN              NaN      BB         AW
3  2002  NaN  NaN    NaN    NaN              NaN      CC         WA
4  2004    A  2.0      M    0.0             ABCD      DD         AW
5  2004  NaN  NaN    NaN    NaN              NaN      EE         WA
6  2005    A  2.0      M    0.0             ABCD      FF         AW
7  2005  NaN  NaN    NaN    NaN              NaN      QQ         WA

Solution with iloc:

df.iloc[:, 1:-2] = df.iloc[:, 1:-2].mask(df.iloc[:, :-2].duplicated())
print (df)
     id city  age gender  Child Generation_group Bodytag RecordType
0  2001    A  2.0      M    0.0             ABCD      WW         AW
1  2002    A  2.0      M    0.0             ABCD      AA         WA
2  2002  NaN  NaN    NaN    NaN              NaN      BB         AW
3  2002  NaN  NaN    NaN    NaN              NaN      CC         WA
4  2004    A  2.0      M    0.0             ABCD      DD         AW
5  2004  NaN  NaN    NaN    NaN              NaN      EE         WA
6  2005    A  2.0      M    0.0             ABCD      FF         AW
7  2005  NaN  NaN    NaN    NaN              NaN      QQ         WA

EDIT:

If want check for duplicates only first column id change cols[:-2] to id and df.iloc[:, :-2].duplicated() to df.duplicated(subset=['id'])

Upvotes: 1

unutbu
unutbu

Reputation: 880339

You could use df.duplicated to find the duplicate id's, and df.loc to select the sub-DataFrame. Notice that df.loc can take 2 indexers -- df.duplicated('id', keep='first') indexes the rows by boolean selection, and df.columns[1:6] selects the columns by label:

In [13]: df.loc[df.duplicated('id', keep='first'), df.columns[1:6]] = np.nan

In [14]: df
Out[14]: 
     id city  age gender  Child Generation_group Bodytag RecordType
0  2001    A  2.0      M    0.0             ABCD      WW         AW
1  2002    A  2.0      M    0.0             ABCD      AA         WA
2  2002  NaN  NaN    NaN    NaN              NaN      BB         AW
3  2002  NaN  NaN    NaN    NaN              NaN      CC         WA
4  2004    A  2.0      M    0.0             ABCD      DD         AW
5  2004  NaN  NaN    NaN    NaN              NaN      EE         WA
6  2005    A  2.0      M    0.0             ABCD      FF         AW
7  2005  NaN  NaN    NaN    NaN              NaN      QQ         WA

Upvotes: 0

Related Questions