Reputation: 472
I have a Pandas DataFrame with some columns that have the same value in every row.
So something like:-
Col1 Col2 Col3 .... ColX ColY ColZ
323 False 324 4 abc Sync
232 False 342 4 def Sync
364 False 2343 4 ghi Sync
So I would like to drop Col2, ColX and ColZ from the above DataFrame.
Upvotes: 3
Views: 4496
Reputation: 2310
You can use nunique
and columns
, to obtain the column names with more than one unique value:
In [6]: df[df.columns[df.nunique() > 1]]
Out[6]:
Col1 Col3 ColY
0 323 324 abc
1 232 342 def
2 364 2343 ghi
Upvotes: 1
Reputation:
You can compare the DataFrame against a particular row (I chose the first one with df.iloc[0]
) and use loc
to select the columns that satisfy the condition you specified:
df.loc[:, ~(df == df.iloc[0]).all()]
Out:
Col1 Col3 ColY
0 323 324 abc
1 232 342 def
2 364 2343 ghi
Timings:
@root's suggestion, nunique
, is quite faster than comparing the Series against a single value. Unless you have a huge number of columns (thousands, for example) iterating over columns as @MMF suggested looks like a more efficient approach.
df = pd.concat([df]*10**5, ignore_index=True)
%timeit df.loc[:, ~(df == df.iloc[0]).all()]
1 loop, best of 3: 377 ms per loop
%timeit df[[col for col in df if not df[col].nunique()==1]]
10 loops, best of 3: 35.6 ms per loop
df = pd.concat([df]*10, axis=1, ignore_index=True)
%timeit df.loc[:, ~(df == df.iloc[0]).all()]
1 loop, best of 3: 3.71 s per loop
%timeit df[[col for col in df if not df[col].nunique()==1]]
1 loop, best of 3: 353 ms per loop
df = pd.concat([df]*3, axis=1, ignore_index=True)
%timeit df.loc[:, ~(df == df.iloc[0]).all()]
1 loop, best of 3: 11.3 s per loop
%timeit df[[col for col in df if not df[col].nunique()==1]]
1 loop, best of 3: 1.06 s per loop
Upvotes: 7
Reputation: 5921
You can also do this checking the length of the set generated by the values of each column :
df = df[[col for col in df if not len(set(df[col]))==1]]
Upvotes: 6