data_b77
data_b77

Reputation: 435

How to remove columns with duplicate values in all rows in pandas

I would like to remove from my dataframe columns with duplicate values in all rows.

I've got dataframe like this:

test =     [('a', 1, 'a', 34, 'b', 34,'a'),
                ('a', 1, 'a', 30, 'v', 30,'a'),
                ('a', 1, 'a', 16, 'a', 16,'a'),
                ('a', 1, 'a', 30, 'a', 30,'a'),
                ('a', 1, 'a', 30, 'v', 30,'a'),
                ('a', 1, 'a', 30, 'd', 30,'a'),
                ('a', 1, 'a', 40, 'a', 40,'a'),
                ('a', 1, 'a', 30, 'a', 30,'a')
                ]
test_df = pd.DataFrame(test, columns=['col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'col7'])

As you see columns: col1, col3, col4, col6, col7 have duplicate values in all rows and my excepted output should be dataframe without duplicate columns. To be more precise, I would like to keep only one of duplicate column, could be e.g. col1 and col4

Upvotes: 1

Views: 112

Answers (1)

jezrael
jezrael

Reputation: 862406

First transpose, then remove duplicates per all rows and last transpose back:

test_df = test_df.T.drop_duplicates().T
print (test_df)
  col1 col2 col4 col5
0    a    1   34    b
1    a    1   30    v
2    a    1   16    a
3    a    1   30    a
4    a    1   30    v
5    a    1   30    d
6    a    1   40    a
7    a    1   30    a

Another solution is convert column to tuples, call Series.duplicated and filter by DataFrame.loc with inverse mask with ~ and boolean indexing:

test_df = test_df.loc[:, ~test_df.apply(tuple).duplicated()]

Upvotes: 2

Related Questions