K. Swan
K. Swan

Reputation: 195

Searching List of Strings Using Regex to Find Substrings Python

I have gone through many of the regex questions on here and used the advice in them, but can't seem to get my code to run still. I have a list of strings, and I am attempting to find the entries in this list that contain one of the following patterns:

For example, I should be able to find sentences that contain phrases like "an idiot of a doctor" or "the hard-worker of a student."

Once found, I want to make a list of the sentences that satisfy this criteria. So far, this is my code:

for sentence in sentences:
    matched = re.search(r"a [.*]of a " \
                        r"an [.*]of an " \
                        r"a [.*]of an" \
                        r"an [.*]of a " \
                        r"that [.*]of a " \
                        r"that [.*]of an " \
                        r"the [.*]of a " \
                        r"the [.*]of an ", sentence)
    if matched:
        bnp.append(matched)

#Below two lines for testing purposes only
print(matched)
print(bnp)

This code turns up no results, despite the fact that there are phrases that should satisfy the criteria in the list.

Upvotes: 1

Views: 1479

Answers (2)

Iron Fist
Iron Fist

Reputation: 10951

[.*] is a character class, so you are asking regex to actually match the dot or star character, quoting from re's docs:

[]

Used to indicate a set of characters. In a set:

Characters can be listed individually, e.g. [amk] will match 'a', 'm', or 'k'.

...

So, here is one way to do it:

(th(at|e)|a[n]?)\b.*\b(a[n]?)\b.*

This expression will try to match either the, that , a or an, then any character up to there is either a or an.

Here in this link, there is a demonstration of it's process.

And here is the actual demonstration:

>>> import re
>>>
>>> regex = r"(th(at|e)|a[n]?)\b.*\b(a[n]?)\b.*"
>>> test_str = ("an idiot of a doctor\n"
    "the hard-worker of a student.\n"
    "an BLANK of an BLANK\n"
    "a BLANK of an BLANK\n"
    "an BLANK of a BLANK\n"
    "that BLANK of a BLANK\n"
    "the BLANK of a BLANK\n"
    "the BLANK of an BLANK\n")
>>>
>>> matches =  re.finditer(regex, test_str, re.MULTILINE | re.IGNORECASE) 
>>> 
>>> for m in matches:
        print(m.group())


an idiot of a doctor
the hard-worker of a student.
an BLANK of an BLANK
a BLANK of an BLANK
an BLANK of a BLANK
that BLANK of a BLANK
the BLANK of a BLANK
the BLANK of an BLANK

Upvotes: 1

Chris
Chris

Reputation: 298

As it stands, this code will concatenate your pattern parameters into one long string with no operators between them. So in effect you are searching for the regex "a [.*]of a an [.*]of an a [.*]of an ..."

You are missing the 'or' operator: |. A simpler regex to accomplish this task would be something like:

(a|an|that|the) \b.*\b of (a|an) \b.*\b

Upvotes: 1

Related Questions