Vivid
Vivid

Reputation: 664

Most concise way to select rows where any column contains a string in Pandas dataframe?

What is the most concise way to select all rows where any column contains a string in a Pandas dataframe?

For example, given the following dataframe what is the best way to select those rows where the value in any column contains a b?

df = pd.DataFrame({
    'x': ['foo', 'foo', 'bar'],
    'y': ['foo', 'foo', 'foo'],
    'z': ['foo', 'baz', 'foo']
})

I'm inexperienced with Pandas and the best I've come up with so far is the rather cumbersome df[df.apply(lambda r: r.str.contains('b').any(), axis=1)]. Is there a simpler solution?

Critically, I want to check for a match in any columns, not a particular column. Other similar questions, as best I can tell, only address a single or list of columns.

Upvotes: 35

Views: 34439

Answers (3)

matt91t
matt91t

Reputation: 171

If you don't like apply:

df.stack()[df.stack().str.contains("b")]

returns

1  z    baz
2  x    bar
dtype: object

and like above with similar to original table properties:

df.stack()[df.stack().str.contains("b")].reset_index().pivot(index="level_0", columns="level_1").droplevel(0, 1)
level_1    x    z
level_0          
1        NaN  baz
2        bar  NaN

Upvotes: 0

wjandrea
wjandrea

Reputation: 33032

Your solution seems decent, but it'd be better to apply over columns instead of rows.

df[df.apply(lambda col: col.str.contains('b')).any(axis=1)]
     x    y    z
1  foo  foo  baz
2  bar  foo  foo

Why not rows?

Iterating over rows in Pandas is an anti-pattern. Here, it means that each row needs to be extracted and converted to a Series before being operated on. On the other hand, the columns are Series already.

For what it's worth, you might also see a speedup from using the vectorized string methods over a larger chunk of data at a time (assuming your real df is much longer than it is wide).

That said, I haven't done any testing of the speed myself.

Upvotes: 2

ihightower
ihightower

Reputation: 3253

This question was not given an answer, but the question itself and the comments have solutions already, which worked really well for me. I didn't find these solutions anywhere else I looked, so I just copy-pasted them.

Solution from @Reason (question):

the best I've come up with so far is the rather cumbersome

df[df.apply(lambda r: r.str.contains('b').any(), axis=1)]

this one worked for me.

Solution from @rbinnun (comment):

df[df.apply(lambda row: row.astype(str).str.contains('b').any(), axis=1)]

takes care of non-string columns, nans, etc.

this one worked for me for a test dataset, but for some real data set, it returned a unicode error as below, but generally a good solution too I think

UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 5: ordinal not in range(128)

Upvotes: 31

Related Questions