Reputation: 1499
I am trying to obtain a list of columns in a DataFrame if any value in a column contains a string. For example in the below dataframe I would like a list of columns that have the % in the string. I am able to accomplish this using a for loop and the series.str.contains method but doens't seem optimal especially with a larger dataset. Is there a more efficient way to do this?
import pandas as pd
df = pd.DataFrame({'A': {0: '2019-06-01', 1: '2019-06-01', 2: '2019-06-01'},
'B': {0: '10', 1: '20', 2: '30'},
'C': {0: '10', 1: '20%', 2: '30%'},
'D': {0: '10%', 1: '20%', 2: '30'},
})
A B C D
0 2019-06-01 10 10 10%
1 2019-06-01 20 20% 20%
2 2019-06-01 30 30% 30
col_list = []
for col in df.columns:
if (True in list(df[col].str.contains('%'))) is True:
col_list.append(col)
['C', 'D']
Upvotes: 8
Views: 7370
Reputation: 11
Look at the below solution can work on object dtypes also.
[c for c in dfif df[c].eq("CSS").any()]
Upvotes: 0
Reputation: 294526
stack
with any
df.columns[df.stack().str.contains('%').any(level=1)]
Index(['C', 'D'], dtype='object')
[c for c in df if df[c].str.contains('%').any()]
['C', 'D']
filter
[*filter(lambda c: df[c].str.contains('%').any(), df)]
['C', 'D']
find
from numpy.core.defchararray import find
df.columns[(find(df.to_numpy().astype(str), '%') >= 0).any(0)]
Index(['C', 'D'], dtype='object')
Upvotes: 16
Reputation: 323386
Let us do melt
df.melt().loc[lambda x :x.value.str.contains('%'),'variable'].unique()
Out[556]: array(['C', 'D'], dtype=object)
Upvotes: 5
Reputation: 403148
Compare with replace
and create a mask to index the columns accordingly:
df.loc[:,(df != df.replace('%', '', regex=True)).any()]
C D
0 10 10%
1 20% 20%
2 30% 30
df.columns[(df != df.replace('%', '', regex=True)).any()]
# Index(['C', 'D'], dtype='object')
This avoids the need for a loop, apply
, or applymap
.
Upvotes: 6
Reputation: 863621
First use DataFrame.select_dtypes
for filter only object columns, obviously string columns.
Then use DataFrame.applymap
for elementwise check values with DataFrame.any
for return True if at least one per column, so possible filter columns:
c = df.columns[df.select_dtypes(object).applymap(lambda x: '%' in str(x)).any()].tolist()
print (c)
['C', 'D']
Or use Series.str.contains
per columns, na
parameter should be omit if all strings columns:
f = lambda x: x.str.contains('%', na=False)
c = df.columns[df.select_dtypes(object).apply(f).any()].tolist()
print (c)
['C', 'D']
Upvotes: 10
Reputation: 150815
Try this:
df.columns[df.apply(lambda x: x.str.contains("\%")).any()]
Upvotes: 6