Get all values that match a condition over entire DataFrame

Question

Given a very large DataFrame with many columns and millions of rows, how would one extract a set of all values that match a given condition. For my use case, the condition is the value is in parentheses i.e., '(example)' would be a value that should match the condition, but '(example' shouldn't.

Minimal reproducible example:

df = pd.DataFrame({'A': ['1', '(2)', '3', '(4', '(5)'], 'B':['a', 'b', '(c)', 'd', 'e']})

Which gives:

    A   B
0   1   a
1   (2) b
2   3   (c)
3   (4  d
4   (5) e

Desired result:

{'(2)', '(5)', '(c)'}

The emphasis is on the scale of the DataFrame, the efficiency of the solution is the key metric here.

jezrael · Accepted Answer

If need match only values () and not ()) or (() or something()anything use Series.str.contains with ^ for start of strings, $.*$ for any values in () and $ for end of string:

s = df.stack()
out = set(s[s.str.contains('^$.*$$')])
print (out)
{'(5)', '(2)', '(c)'}

Get all values that match a condition over entire DataFrame

Answers (2)

Related Questions