Reputation: 195
I have data like this:
id city age gender Child Generation_group Bodytag RecordType
2001 A 2 M 0 ABCD WW AW
2002 A 2 M 0 ABCD AA WA
2002 A 2 M 0 ABCD BB AW
2002 A 2 M 0 ABCD CC WA
2004 A 2 M 0 ABCD DD AW
2004 A 2 M 0 ABCD EE WA
2005 A 2 M 0 ABCD FF AW
2005 A 2 M 0 ABCD QQ WA
I want output like this:
id city age gender Child Generation_group Bodytag RecordType
2001 A 2 M 0 ABCD WW AW
2002 A 2 M 0 ABCD AA WA
2002 Null Null Null Null Null BB AW
2002 Null Null Null Null Null CC WA
2004 A 2 M 0 ABCD DD AW
2004 Null Null Null Null Null EE WA
2005 A 2 M 0 ABCD FF AW
2005 Null Null Null Null Null QQ WA
I want null values where duplicate ids are there but only in first 5 columns after id columns , I don't want null in last two columns
How do I do it in pandas ?
Upvotes: 1
Views: 126
Reputation: 863226
You can use mask
with duplicated
and select columns via []
:
cols = df.columns
df[cols[1:-2]] = df[cols[1:-2]].mask(df[cols[:-2]].duplicated())
print (df)
id city age gender Child Generation_group Bodytag RecordType
0 2001 A 2.0 M 0.0 ABCD WW AW
1 2002 A 2.0 M 0.0 ABCD AA WA
2 2002 NaN NaN NaN NaN NaN BB AW
3 2002 NaN NaN NaN NaN NaN CC WA
4 2004 A 2.0 M 0.0 ABCD DD AW
5 2004 NaN NaN NaN NaN NaN EE WA
6 2005 A 2.0 M 0.0 ABCD FF AW
7 2005 NaN NaN NaN NaN NaN QQ WA
Similar solution:
cols = df.columns
df[cols[1:-2]] = df[cols[1:-2]].mask(df.duplicated(subset=cols[:-2]))
print (df)
id city age gender Child Generation_group Bodytag RecordType
0 2001 A 2.0 M 0.0 ABCD WW AW
1 2002 A 2.0 M 0.0 ABCD AA WA
2 2002 NaN NaN NaN NaN NaN BB AW
3 2002 NaN NaN NaN NaN NaN CC WA
4 2004 A 2.0 M 0.0 ABCD DD AW
5 2004 NaN NaN NaN NaN NaN EE WA
6 2005 A 2.0 M 0.0 ABCD FF AW
7 2005 NaN NaN NaN NaN NaN QQ WA
Solution with iloc
:
df.iloc[:, 1:-2] = df.iloc[:, 1:-2].mask(df.iloc[:, :-2].duplicated())
print (df)
id city age gender Child Generation_group Bodytag RecordType
0 2001 A 2.0 M 0.0 ABCD WW AW
1 2002 A 2.0 M 0.0 ABCD AA WA
2 2002 NaN NaN NaN NaN NaN BB AW
3 2002 NaN NaN NaN NaN NaN CC WA
4 2004 A 2.0 M 0.0 ABCD DD AW
5 2004 NaN NaN NaN NaN NaN EE WA
6 2005 A 2.0 M 0.0 ABCD FF AW
7 2005 NaN NaN NaN NaN NaN QQ WA
EDIT:
If want check for duplicates only first column id
change cols[:-2]
to id
and df.iloc[:, :-2].duplicated()
to df.duplicated(subset=['id'])
Upvotes: 1
Reputation: 880339
You could use df.duplicated
to find the duplicate id's, and df.loc
to select the sub-DataFrame. Notice that df.loc
can take 2 indexers -- df.duplicated('id', keep='first')
indexes the rows by boolean selection, and df.columns[1:6]
selects the columns by label:
In [13]: df.loc[df.duplicated('id', keep='first'), df.columns[1:6]] = np.nan
In [14]: df
Out[14]:
id city age gender Child Generation_group Bodytag RecordType
0 2001 A 2.0 M 0.0 ABCD WW AW
1 2002 A 2.0 M 0.0 ABCD AA WA
2 2002 NaN NaN NaN NaN NaN BB AW
3 2002 NaN NaN NaN NaN NaN CC WA
4 2004 A 2.0 M 0.0 ABCD DD AW
5 2004 NaN NaN NaN NaN NaN EE WA
6 2005 A 2.0 M 0.0 ABCD FF AW
7 2005 NaN NaN NaN NaN NaN QQ WA
Upvotes: 0