python search different strings on same line

Question

i have the following code that I'd like to optimize:

if re.search(str(stringA), line) and re.search(str(stringB), line):
    .....
    .....

I tried:

stringAB = stringA + '.*' + stringB
if re.search(str(stringAB), line):
    .....
    .....

But the results I get is not reliable. I'm using "re.search" here because it seems to be the only way i can search for the exact regex of the pattern specified in stringA and stringB.

The logic behind this code is modeled after this egrep command example:

stringA=Success
stringB=mysqlDB01

egrep "${stringA}" /var/app/mydata | egrep "${stringB}"

If there's a better way to do this without re.search, please let me know.

PM 2Ring · Accepted Answer

One way to do this is to make a pattern that matches either word (using \b so we only match complete words), use re.findall to check the string for all matches, and then use set equality to ensure that both words have been matched.

import re

stringA = "spam"
stringB = "egg"

words = {stringA, stringB}

# Make a pattern that matches either word
pat = re.compile(r"\b{}\b|\b{}\b".format(stringA, stringB))

data = [
    "this string has spam in it",
    "this string has egg in it",
    "this string has egg in it and another egg too",
    "this string has both egg and spam in it",
    "the word spams shouldn't match",
    "and eggs shouldn't match, either",
]

for s in data:
    found = pat.findall(s)
    print(repr(s), found, set(found) == words)

output

'this string has spam in it' ['spam'] False
'this string has egg in it' ['egg'] False
'this string has egg in it and another egg too' ['egg', 'egg'] False
'this string has both egg and spam in it' ['egg', 'spam'] True
"the word spams shouldn't match" [] False
"and eggs shouldn't match, either" [] False

A slightly more efficent way to do set(found) == words is to use words.issubset(found), since it skips the explicit conversion of found.

As Jon Clements mentions in a comment, we can simplify and generalize the pattern to handle any number of words, and we should use re.escape, just in case any of the words contain regex metacharacters.

pat = re.compile(r"\b({})\b".format("|".join(re.escape(word) for word in words)))

Thanks, Jon!

Here's a version that matches the words in the specified order. If it finds a match it prints the matching substring, otherwise it prints None.

import re

stringA = "spam"
stringB = "egg"
words = [stringA, stringB]

# Make a pattern that matches all the words, in order
pat = r"\b.*?\b".join([re.escape(word) for word in words])
pat = re.compile(r"\b" + pat + r"\b")

data = [
    "this string has spam and also egg, in the proper order",
    "this string has spam in it",
    "this string has spamegg in it",
    "this string has egg in it",
    "this string has egg in it and another egg too",
    "this string has both egg and spam in it",
    "the word spams shouldn't match",
    "and eggs shouldn't match, either",
]

for s in data:
    found = pat.search(s)
    if found:
        found = found.group()
    print('{!r}: {!r}'.format(s, found))

output

'this string has spam and also egg, in the proper order': 'spam and also egg'
'this string has spam in it': None
'this string has spamegg in it': None
'this string has egg in it': None
'this string has egg in it and another egg too': None
'this string has both egg and spam in it': None
"the word spams shouldn't match": None
"and eggs shouldn't match, either": None

python search different strings on same line

Answers (1)

Related Questions