Reputation: 1499

Select columns if any of their rows contain a certain string

I am trying to obtain a list of columns in a DataFrame if any value in a column contains a string. For example in the below dataframe I would like a list of columns that have the % in the string. I am able to accomplish this using a for loop and the series.str.contains method but doens't seem optimal especially with a larger dataset. Is there a more efficient way to do this?

import pandas as pd

df = pd.DataFrame({'A': {0: '2019-06-01', 1: '2019-06-01', 2: '2019-06-01'},
                   'B': {0: '10', 1: '20', 2: '30'},
                   'C': {0: '10', 1: '20%', 2: '30%'},
                   'D': {0: '10%', 1: '20%', 2: '30'},
               })

DataFrame

            A   B    C    D
0  2019-06-01  10   10  10%
1  2019-06-01  20  20%  20%
2  2019-06-01  30  30%   30

Current Method

col_list = []
for col in df.columns:
    if (True in list(df[col].str.contains('%'))) is True:
        col_list.append(col)

Output

['C', 'D']

Upvotes: 8

Answers (6)

chetan wankhede

Reputation: 11

Look at the below solution can work on object dtypes also.

[c for c in dfif df[c].eq("CSS").any()]

Upvotes: 0

piRSquared

Reputation: 294526

`stack` with `any`

df.columns[df.stack().str.contains('%').any(level=1)]

Index(['C', 'D'], dtype='object')

comprehension

[c for c in df if df[c].str.contains('%').any()]

['C', 'D']

`filter`

[*filter(lambda c: df[c].str.contains('%').any(), df)]

['C', 'D']

Numpy's `find`

from numpy.core.defchararray import find

df.columns[(find(df.to_numpy().astype(str), '%') >= 0).any(0)]

Index(['C', 'D'], dtype='object')

Upvotes: 16

BENY

Reputation: 323386

Let us do melt

df.melt().loc[lambda x :x.value.str.contains('%'),'variable'].unique()
Out[556]: array(['C', 'D'], dtype=object)

Upvotes: 5

cs95

Reputation: 403148

Compare with replace and create a mask to index the columns accordingly:

df.loc[:,(df != df.replace('%', '', regex=True)).any()]
     C    D
0   10  10%
1  20%  20%
2  30%   30

df.columns[(df != df.replace('%', '', regex=True)).any()]
# Index(['C', 'D'], dtype='object')

This avoids the need for a loop, apply, or applymap.

Upvotes: 6

jezrael

Reputation: 863621

First use DataFrame.select_dtypes for filter only object columns, obviously string columns.

Then use DataFrame.applymap for elementwise check values with DataFrame.any for return True if at least one per column, so possible filter columns:

c = df.columns[df.select_dtypes(object).applymap(lambda x: '%' in str(x)).any()].tolist()
print (c)
['C', 'D']

Or use Series.str.contains per columns, na parameter should be omit if all strings columns:

f = lambda x: x.str.contains('%', na=False)
c = df.columns[df.select_dtypes(object).apply(f).any()].tolist()
print (c)
['C', 'D']

Upvotes: 10

Quang Hoang

Reputation: 150815

Try this:

df.columns[df.apply(lambda x: x.str.contains("\%")).any()]