Reputation: 2726
How can I get a Boolean list that indicates matches found of every group where a positive lookahead is used for 'AND' inside of one of the groups? I only want one Boolean returned for each group.
Example:
I want to get a list of [True, True]
returned for the following string 'one two three'
.
[bool(x) for x in re.findall('(one)|((?=.*three)(?=.*two))', 'one two three')]
Gives: [True, True, True]
[bool(x) for x in re.findall('(one)(?=.*three)(?=.*two)', 'one two three')]
Gives: [True]
[bool(x) for x in re.findall('(one)|(?=.*three)(?=.*two)', 'one two three')]
Gives: [True, False, False]
I want [True, True]
That is, the second and final True
is given when 'two'
AND 'three'
are in the string in any order.
In plain language, I want a pattern that can return True (pattern found) or False (pattern not found) for every group in the pattern. I need to use logical AND's inside groups so that the order of the patterns separated by AND inside the group does not matter, it is just every pattern must be found for the whole group to get labeled as True
.
So, using ()
as group indicators, a "pattern" (one) , (three AND two)
For the string 'one two three'
, I would get [True, True]
For the string 'one three two'
, I would get [True, True]
For the string 'two three one'
, I would get [True, True]
For the string 'one three ten'
, I would get [True, False]
For the string 'ten three two'
, I would get [False, True]
The re.findall()
or re.findinter()
in python, or pd.Series.str.extractall()
in Pandas returns something for each 'group'. Using one of those, I can use a regex OR, '|'
, to separate the groups and get something returned for each 'group' it "finds" (the string itself) or does "not find" (an empty string or nan) which can then be converted into True
or False
.
For-loops can work, but my actual use case has hundreds of thousands of strings and several thousand search lists each with 10-20 patterns to loop through on each string. Completing these for-loops (for every string: for every pattern-list: for every pattern) is very slow. I am trying to combine the pattern-list into one pattern and get the same results.
I have this working using str.extractall()
in Pandas. I just can't get the logical AND to work inside of a capture 'group'. That is the only thing I am stuck on and the basis of this question.
The Pandas code would be something like:
import pandas as pd
ser = pd.Series(['one two three'])
(~ser.str.extractall('(one)|(?=.*three)(?=.*two)').isna()).values.tolist()
Returns: [[True], [False], [False]]
, which could easily be collapsed into a list of bools rather than a list of lists, however, this has the same problems I showed above.
Upvotes: 1
Views: 929
Reputation: 2726
Avinash Raj's answer led me to the correct result. Specifically, naming the first pattern in the pattern groups that have an 'AND' regex construct separating patterns, and naming all other patterns. So I selected that answer.
A generalized example following my specific use case follows.
import pandas as pd
import numpy as np
regex_list = [['one'],['three','two'], ['four'], ['five', 'six', 'seven']]
def regex_single_make(regex_list):
tmplist = []
for n,l in enumerate(regex_list):
if len(l) == 1:
tmplist.append(r'(?P<_{}>\b{}\b)'.format(n, l[0]))
else:
tmplist.append(
''.join(
[r'(?=.*(?P<_{}>\b{}\b))'.format(n, v)
if k == 0
else r'(?=.*\b{}\b)'.format(v)
for k,v in enumerate(l)]))
return '|'.join(tmplist)
regex_single_make(regex_list)
regex_single
'(?P<_0>\\bone\\b)|(?=.*(?P<_1>\\bthree\\b))(?=.*\\btwo\\b)|(?P<_2>\\bfour\\b)|(?=.*(?P<_3>\\bfive\\b))(?=.*\\bsix\\b)(?=.*\\bseven\\b)'
b = pd.Series([
'one two three four five six seven',
'there is no match in this example text',
'seven six five four three one twenty ten',
'except four, no matching strings',
'no_one, three AND two, no_four, five AND seven AND six'])
match_lists = (np.apply_along_axis(
lambda vec: vec[[regex_list.index(x) for x in regex_list]], 1, (
(~b.str.extractall(regex_single).isna())
.reset_index()
.groupby('level_0').agg('sum')
.drop(columns='match')
.reindex(range(b.size), fill_value=False)
.values > 0 )
).tolist())
match_lists
[[True, True, True, True],
[False, False, False, False],
[True, False, True, True],
[False, False, True, False],
[False, True, False, True]]
Upvotes: 0
Reputation: 174716
We could simply solve this problem through named capturing group. I just separated the patterns into two parts. Check the first and second part exists or not, if yes then return True
for corresponding part else return False
.
>>> def findstr(x):
first = second = False
matches = re.finditer(r'(?P<first>one)|(?=.*(?P<second>three))(?=.*two)', x)
for match in matches:
if match.group('first'):
first = True
elif match.group('second'):
second = True
return [first, second]
>>> str_lst = ['one two three', 'one three two', 'two three one', 'one three ten', 'ten three two']
>>> for stri in str_lst:
print(findstr(stri))
[True, True]
[True, True]
[True, True]
[True, False]
[False, True]
>>>
Note that the second group get's captured only if both two
and three
exists on the string. Check the demo below for clarification.
Upvotes: 1
Reputation: 1296
The following line uses re.finditer
instead of re.findall
. Also the regex needs a .+
in the end in order to match the entire string when both two
and three
are present no matter the order.
[bool(x) for x in re.finditer('(one)|(?=.*two)(?=.*three).+', 'one three two')]
This also works for one three two four
as mentioned in one of ops comments without having to declare all possible permutations.
[bool(x) for x in re.finditer('(one)|(?=.*two)(?=.*three)(?=.*four).+', 'one two four three')]
Upvotes: 0
Reputation: 27723
My guess is that you wish to design some expression similar to:
[bool(x) for x in re.findall(r'^(?:one\b.*?)\b(two|three)\b|\b(three|two)\b.*$', 'one three two')]
not sure though or maybe:
search = ['two','three']
string_to_search = 'one two three'
output = []
for word in search:
if word in string_to_search:
output.append(True)
print(output)
[True, True]
Upvotes: 1