Reputation: 435
I would like to remove from my dataframe columns with duplicate values in all rows.
I've got dataframe like this:
test = [('a', 1, 'a', 34, 'b', 34,'a'),
('a', 1, 'a', 30, 'v', 30,'a'),
('a', 1, 'a', 16, 'a', 16,'a'),
('a', 1, 'a', 30, 'a', 30,'a'),
('a', 1, 'a', 30, 'v', 30,'a'),
('a', 1, 'a', 30, 'd', 30,'a'),
('a', 1, 'a', 40, 'a', 40,'a'),
('a', 1, 'a', 30, 'a', 30,'a')
]
test_df = pd.DataFrame(test, columns=['col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'col7'])
As you see columns: col1, col3, col4, col6, col7 have duplicate values in all rows and my excepted output should be dataframe without duplicate columns. To be more precise, I would like to keep only one of duplicate column, could be e.g. col1 and col4
Upvotes: 1
Views: 112
Reputation: 862406
First transpose, then remove duplicates per all rows and last transpose back:
test_df = test_df.T.drop_duplicates().T
print (test_df)
col1 col2 col4 col5
0 a 1 34 b
1 a 1 30 v
2 a 1 16 a
3 a 1 30 a
4 a 1 30 v
5 a 1 30 d
6 a 1 40 a
7 a 1 30 a
Another solution is convert column to tuples, call Series.duplicated
and filter by DataFrame.loc
with inverse mask with ~
and boolean indexing
:
test_df = test_df.loc[:, ~test_df.apply(tuple).duplicated()]
Upvotes: 2