Reputation: 2726

Regex with multiple groups that use lookahead for logical AND inside regex group

How can I get a Boolean list that indicates matches found of every group where a positive lookahead is used for 'AND' inside of one of the groups? I only want one Boolean returned for each group.

Example: I want to get a list of [True, True] returned for the following string 'one two three'.

[bool(x) for x in re.findall('(one)|((?=.*three)(?=.*two))', 'one two three')]

Gives: [True, True, True]

[bool(x) for x in re.findall('(one)(?=.*three)(?=.*two)', 'one two three')]

Gives: [True]

[bool(x) for x in re.findall('(one)|(?=.*three)(?=.*two)', 'one two three')]

Gives: [True, False, False]

I want [True, True]

That is, the second and final True is given when 'two' AND 'three' are in the string in any order.

Edit for clarification:

In plain language, I want a pattern that can return True (pattern found) or False (pattern not found) for every group in the pattern. I need to use logical AND's inside groups so that the order of the patterns separated by AND inside the group does not matter, it is just every pattern must be found for the whole group to get labeled as True.

So, using () as group indicators, a "pattern" (one) , (three AND two)

For the string 'one two three', I would get [True, True]
For the string 'one three two', I would get [True, True]
For the string 'two three one', I would get [True, True]
For the string 'one three ten', I would get [True, False]
For the string 'ten three two', I would get [False, True]

The re.findall() or re.findinter() in python, or pd.Series.str.extractall() in Pandas returns something for each 'group'. Using one of those, I can use a regex OR, '|', to separate the groups and get something returned for each 'group' it "finds" (the string itself) or does "not find" (an empty string or nan) which can then be converted into True or False.

For-loops can work, but my actual use case has hundreds of thousands of strings and several thousand search lists each with 10-20 patterns to loop through on each string. Completing these for-loops (for every string: for every pattern-list: for every pattern) is very slow. I am trying to combine the pattern-list into one pattern and get the same results.

I have this working using str.extractall() in Pandas. I just can't get the logical AND to work inside of a capture 'group'. That is the only thing I am stuck on and the basis of this question.

The Pandas code would be something like:

import pandas as pd
ser = pd.Series(['one two three']) 
(~ser.str.extractall('(one)|(?=.*three)(?=.*two)').isna()).values.tolist()

Returns: [[True], [False], [False]], which could easily be collapsed into a list of bools rather than a list of lists, however, this has the same problems I showed above.

Upvotes: 1

Answers (4)

Clay

Reputation: 2726

Avinash Raj's answer led me to the correct result. Specifically, naming the first pattern in the pattern groups that have an 'AND' regex construct separating patterns, and naming all other patterns. So I selected that answer.

A generalized example following my specific use case follows.

import pandas as pd
import numpy as np

regex_list = [['one'],['three','two'], ['four'], ['five', 'six', 'seven']]

def regex_single_make(regex_list):
    tmplist = []
    for n,l in enumerate(regex_list):
        if len(l) == 1:
            tmplist.append(r'(?P<_{}>\b{}\b)'.format(n, l[0]))
        else:
            tmplist.append(
                ''.join(
                    [r'(?=.*(?P<_{}>\b{}\b))'.format(n, v)
                    if k == 0 
                    else r'(?=.*\b{}\b)'.format(v)
                    for k,v in enumerate(l)]))
    return '|'.join(tmplist)

regex_single_make(regex_list)

regex_single

'(?P<_0>\\bone\\b)|(?=.*(?P<_1>\\bthree\\b))(?=.*\\btwo\\b)|(?P<_2>\\bfour\\b)|(?=.*(?P<_3>\\bfive\\b))(?=.*\\bsix\\b)(?=.*\\bseven\\b)'

b = pd.Series([
    'one two three four five six seven', 
    'there is no match in this example text',
    'seven six five four three one twenty ten',
    'except four, no matching strings',
    'no_one, three AND two, no_four, five AND seven AND six'])

match_lists = (np.apply_along_axis(
        lambda vec: vec[[regex_list.index(x) for x in regex_list]], 1, (
        (~b.str.extractall(regex_single).isna())
        .reset_index()
        .groupby('level_0').agg('sum')
        .drop(columns='match')
        .reindex(range(b.size), fill_value=False)
        .values > 0 )
    ).tolist())

match_lists

[[True, True, True, True],
 [False, False, False, False],
 [True, False, True, True],
 [False, False, True, False],
 [False, True, False, True]]

Upvotes: 0

Avinash Raj

Reputation: 174716

We could simply solve this problem through named capturing group. I just separated the patterns into two parts. Check the first and second part exists or not, if yes then return True for corresponding part else return False.

>>> def findstr(x):
    first = second = False
    matches = re.finditer(r'(?P<first>one)|(?=.*(?P<second>three))(?=.*two)', x)
    for match in matches:
        if match.group('first'):
            first = True
        elif match.group('second'):
            second = True
    return [first, second]

>>> str_lst = ['one two three', 'one three two', 'two three one', 'one three ten', 'ten three two']
>>> for stri in str_lst:
    print(findstr(stri))


[True, True]
[True, True]
[True, True]
[True, False]
[False, True]
>>>

Note that the second group get's captured only if both two and three exists on the string. Check the demo below for clarification.

DEMO

Upvotes: 1

Bee

Reputation: 1296

The following line uses re.finditer instead of re.findall. Also the regex needs a .+ in the end in order to match the entire string when both two and three are present no matter the order.

[bool(x) for x in re.finditer('(one)|(?=.*two)(?=.*three).+', 'one three two')]

This also works for one three two four as mentioned in one of ops comments without having to declare all possible permutations.

[bool(x) for x in re.finditer('(one)|(?=.*two)(?=.*three)(?=.*four).+', 'one two four three')]

Upvotes: 0

Emma

Reputation: 27723

My guess is that you wish to design some expression similar to:

[bool(x) for x in re.findall(r'^(?:one\b.*?)\b(two|three)\b|\b(three|two)\b.*$', 'one three two')]

not sure though or maybe:

search = ['two','three']
string_to_search = 'one two three'

output = []
for word in search:
    if word in string_to_search:
        output.append(True)

print(output)

Output

[True, True]

Upvotes: 1

Regex with multiple groups that use lookahead for logical AND inside regex group

Edit for clarification:

Answers (4)

Output

Related Questions