Reputation: 1434
Given a very large DataFrame with many columns and millions of rows, how would one extract a set of all values that match a given condition. For my use case, the condition is the value is in parentheses i.e., '(example)'
would be a value that should match the condition, but '(example'
shouldn't.
Minimal reproducible example:
df = pd.DataFrame({'A': ['1', '(2)', '3', '(4', '(5)'], 'B':['a', 'b', '(c)', 'd', 'e']})
Which gives:
A B
0 1 a
1 (2) b
2 3 (c)
3 (4 d
4 (5) e
Desired result:
{'(2)', '(5)', '(c)'}
The emphasis is on the scale of the DataFrame, the efficiency of the solution is the key metric here.
Upvotes: 1
Views: 852
Reputation: 260490
You can unstack to make your data 1D, filter using the '\(.*\)'
regex that matches anything surrounded by parentheses, and transform to set:
s = df.unstack()
set(s[s.str.match('\(.*\)')])
output:
{'(2)', '(5)', '(c)'}
NB. if you need to ensure that the parentheses are in the extremities, use '\(.*\)$'
as regex
Upvotes: 1
Reputation: 862601
If need match only values ()
and not ())
or (()
or something()anything
use Series.str.contains
with ^
for start of strings, \(.*\)
for any values in ()
and $
for end of string:
s = df.stack()
out = set(s[s.str.contains('^\(.*\)$')])
print (out)
{'(5)', '(2)', '(c)'}
Upvotes: 1