ChrisOram
ChrisOram

Reputation: 1434

Get all values that match a condition over entire DataFrame

Given a very large DataFrame with many columns and millions of rows, how would one extract a set of all values that match a given condition. For my use case, the condition is the value is in parentheses i.e., '(example)' would be a value that should match the condition, but '(example' shouldn't.

Minimal reproducible example:

df = pd.DataFrame({'A': ['1', '(2)', '3', '(4', '(5)'], 'B':['a', 'b', '(c)', 'd', 'e']})

Which gives:

    A   B
0   1   a
1   (2) b
2   3   (c)
3   (4  d
4   (5) e

Desired result:

{'(2)', '(5)', '(c)'}

The emphasis is on the scale of the DataFrame, the efficiency of the solution is the key metric here.

Upvotes: 1

Views: 852

Answers (2)

mozway
mozway

Reputation: 260490

You can unstack to make your data 1D, filter using the '\(.*\)' regex that matches anything surrounded by parentheses, and transform to set:

s = df.unstack()
set(s[s.str.match('\(.*\)')])

output:

{'(2)', '(5)', '(c)'}

NB. if you need to ensure that the parentheses are in the extremities, use '\(.*\)$' as regex

Upvotes: 1

jezrael
jezrael

Reputation: 862601

If need match only values () and not ()) or (() or something()anything use Series.str.contains with ^ for start of strings, \(.*\) for any values in () and $ for end of string:

s = df.stack()
out = set(s[s.str.contains('^\(.*\)$')])
print (out)
{'(5)', '(2)', '(c)'}

Upvotes: 1

Related Questions