porgrammer3124
porgrammer3124

Reputation: 85

regex match word and what comes after it

I need some help with a regex I am writing. I have a list of words that I want to match and words that might come after them (words meaning [A-Za-z/\s]+) I.e no parenthesis,symbols, numbers.

words = ['qtr','hard','quarter'] # keywords that must exist

test=['id:12345 cli hard/qtr Mix',
'id:12345 cli qtr 90%',
'id:12345 cli hard (red)',
'id:12345 cli hard work','Hello world']

excepted output is

['hard/qtr Mix', 'qtr', 'hard', 'hard work', None]

What I have tried so far

re.search(r'((hard|qtr|quarter)(?:[[A-Za-z/\s]+]))',x,re.I)

Upvotes: 2

Views: 228

Answers (3)

imxitiz
imxitiz

Reputation: 3987

Idea extracted from the existing answer and made shorter :

>>> pattern = re.compile('(('+'|'.join(words)+')([a-zA-Z/ ]*))')
>>> [pattern.search(x).group(0)  if pattern.search(x) else None for x in test])
['hard/qtr Mix', 'qtr ', 'hard ', 'hard work', None]

As mentioned in comment :

But it is quite inefficient, because it needs to search for same pattern twice, once for pattern.search(x).group(0) and the other one for if pattern.search(x), and list-comprehension is not the best way to go about in such scenarios.

We can try this to overcome that issue :

>>> [v.group(0) if v else None for v in (pattern.search(x) for x in test)]
['hard/qtr Mix', 'qtr ', 'hard ', 'hard work', None]

Upvotes: 1

ThePyGuy
ThePyGuy

Reputation: 18466

The problem with the pattern you have i.e.'((hard|qtr|quarter)(?:[[A-Za-z/\s]+]))', you have \s inside squared brackets [] which means to match the characters individually i.e. either \ or s, instead, you can just use space character i.e.

You can join all the words in words list by | to create the pattern '((qtr|hard|quarter)([a-zA-Z/ ]*))', then search for the pattern in each of strings in the list, if the match is found, take the group 0 and append it to the resulting list, else, append None:

pattern = re.compile('(('+'|'.join(words)+')([a-zA-Z/ ]*))')
result = []
for x in test:
    groups = pattern.search(x)
    if groups:
        result.append(groups.group(0))
    else:
        result.append(None)      

OUTPUT:

result
['hard/qtr Mix', 'qtr ', 'hard ', 'hard work', None]

And since you are including the space characters, you may end up with some values that has space at the end, you can just strip off the white space characters later.

Upvotes: 1

vszholobov
vszholobov

Reputation: 2363

You can put all needed words in or expression and put your word definition after that

import re

words = ['qtr','hard','quarter']

regex = r"(" + "|".join(words) + ")[A-Za-z\/\s]+"

p = re.compile(regex)
test=['id:12345 cli hard/qtr Mix(qtr',
'id:12345 cli qtr 90%',
'id:12345 cli hard (red)',
'id:12345 cli hard work','Hello world']

for string in test:
    result = p.search(string)
    if result is not None:
        print(p.search(string).group(0))
    else:
        print(result)

Output:

hard/qtr Mix
qtr 
hard 
hard work
None

Upvotes: 0

Related Questions