Remove duplicates in a row based on column value

Question

Hi couldn't find anything about this specifically, sorry if its a duplicate...

How do I remove column values of a single row that contain the same information (with some exceptions)

Example:

      Name     Age     Job    How_Old    Occupation   Happy   Married?
 0    John     35      Dev    35         Dev          True    True
 1    Sally    42      CA     42         CA           False   False

I would like to drop columns of different names that contain the same information except for ones that contain some obvious duplicates like a binary column.

Output:

     Name     Age    Job   Happy    Married?
0    John     35     Dev   True     True
1    Sally    42     CA    False    False

Thanks, also please note that I need to perform this operation on a massvie flattend and normalised json file, so looping would be quite time expensive.

jezrael · Accepted Answer

First exlude boolean columns by DataFrame.select_dtypes, transpose and get duplicates by DataFrame.duplicated per all rows, then invert mask by ~ and add removed boolean columns by Series.reindex, last is filtered by DataFrame.loc for all rows by first : and columns names by mask:

m = (~df.select_dtypes(exclude=bool).T.duplicated()).reindex(df.columns, fill_value=True)

Another idea is convert values to tuples and call Series.duplicated:

m = ((~df.select_dtypes(exclude=bool).apply(tuple).duplicated())
         .reindex(df.columns, fill_value=True))

df = df.loc[:, m]
print (df)
    Name  Age  Job  Happy  Married?
0   John   35  Dev   True      True
1  Sally   42   CA  False     False

Details:

#exlude boolean columns
print (df.select_dtypes(exclude=bool))
    Name  Age  Job  How_Old Occupation
0   John   35  Dev       35        Dev
1  Sally   42   CA       42         CA

#transpose
print (df.select_dtypes(exclude=bool).T)
               0      1
Name        John  Sally
Age           35     42
Job          Dev     CA
How_Old       35     42
Occupation   Dev     CA

#checked duplicates per all columns
print (df.select_dtypes(exclude=bool).T.duplicated())
Name          False
Age           False
Job           False
How_Old        True
Occupation     True

#inverse mask True->False, False->True
print ((~df.select_dtypes(exclude=bool).T.duplicated()))
Name           True
Age            True
Job            True
How_Old       False
Occupation    False
dtype: bool

#added removed boolean columns with Trues
print ((~df.select_dtypes(exclude=bool).T.duplicated())
           .reindex(df.columns, fill_value=True))
Name           True
Age            True
Job            True
How_Old       False
Occupation    False
Happy          True
Married?       True
dtype: bool

Remove duplicates in a row based on column value

Answers (2)

Related Questions