Reputation: 664
What is the most concise way to select all rows where any column contains a string in a Pandas dataframe?
For example, given the following dataframe what is the best way to select those rows where the value in any column contains a b
?
df = pd.DataFrame({
'x': ['foo', 'foo', 'bar'],
'y': ['foo', 'foo', 'foo'],
'z': ['foo', 'baz', 'foo']
})
I'm inexperienced with Pandas and the best I've come up with so far is the rather cumbersome df[df.apply(lambda r: r.str.contains('b').any(), axis=1)]
. Is there a simpler solution?
Critically, I want to check for a match in any columns, not a particular column. Other similar questions, as best I can tell, only address a single or list of columns.
Upvotes: 35
Views: 34439
Reputation: 171
If you don't like apply:
df.stack()[df.stack().str.contains("b")]
returns
1 z baz
2 x bar
dtype: object
and like above with similar to original table properties:
df.stack()[df.stack().str.contains("b")].reset_index().pivot(index="level_0", columns="level_1").droplevel(0, 1)
level_1 x z
level_0
1 NaN baz
2 bar NaN
Upvotes: 0
Reputation: 33032
Your solution seems decent, but it'd be better to apply over columns instead of rows.
df[df.apply(lambda col: col.str.contains('b')).any(axis=1)]
x y z
1 foo foo baz
2 bar foo foo
Iterating over rows in Pandas is an anti-pattern. Here, it means that each row needs to be extracted and converted to a Series
before being operated on. On the other hand, the columns are Series
already.
For what it's worth, you might also see a speedup from using the vectorized string methods over a larger chunk of data at a time (assuming your real df is much longer than it is wide).
That said, I haven't done any testing of the speed myself.
Upvotes: 2
Reputation: 3253
This question was not given an answer, but the question itself and the comments have solutions already, which worked really well for me. I didn't find these solutions anywhere else I looked, so I just copy-pasted them.
the best I've come up with so far is the rather cumbersome
df[df.apply(lambda r: r.str.contains('b').any(), axis=1)]
this one worked for me.
df[df.apply(lambda row: row.astype(str).str.contains('b').any(), axis=1)]
takes care of non-string columns, nans, etc.
this one worked for me for a test dataset, but for some real data set, it returned a unicode error as below, but generally a good solution too I think
UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 5: ordinal not in range(128)
Upvotes: 31