Ken Masters
Ken Masters

Reputation: 281

How to find certain words using regular expression in python?

I am learning the ropes with regular expression in Python. I have the code below:

import re

test = '"(Z101+Z102+Z1034+Z104)/4"'
regex = re.compile(r"[\(\+]([XYZ]\d\d\d)[\)\+]")
regex.findall(test)

It returns:

['Z101', 'Z104']

However, when I change 'Z101' to 'YZ101':

import re

test = '"(YZ101+Z102+Z1034+Z104)/4"'
regex = re.compile(r"[\(\+]([XYZ]\d\d\d)[\)\+]")
regex.findall(test)

It returns:

['Z102', 'Z104']

The purpose is to extract strings containing X, Y or Z following by any set of three digits. Therefore, the desired output for the first code would be:

['Z101', 'Z102', 'Z104']

How to fix the compile and get the correct output?

Upvotes: 2

Views: 657

Answers (3)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626691

The left and right hand boundary patterns ([\(\+] and [\)\+]) are consuming the text they match, and thus consecutive matches are not thus detected.

You can solve the problem using lookarounds,

r"(?<=[(+])([XYZ]\d\d\d)(?=[)+])"
r"(?<=[(+])[XYZ]\d{3}(?=[)+])"

Details

  • (?<=[(+]) - a positive lookbehind that matches a location that is immediately preceded with ( or +
  • [XYZ] - X, Y or Z
  • \d{3} - three digits
  • (?=[)+]) - a positive lookahead that makes sure there is ) or + immediately to the right of the current location.

Note the word boundary, \b, can solve the issue in some situations, it might also help you here, too.

Upvotes: 3

Kyle Alm
Kyle Alm

Reputation: 587

Your pattern is looking for:

  1. Either '(' or '+'
  2. Exactly one of 'X', 'Y', or 'Z'
  3. Exactly three numeric characters
  4. Either '(' or '+'

It's not selecting the 'Z101' because when you add 'Y', that substring isn't immediately preceded by '(' or '+'.

One option would be to leave 1 and 4 out of the pattern. In this example, you would get exactly what you want. That pattern would be r'[XYZ]\d\d\d'. Depending on your data, however, that might create a different problem down the road.

Another option would be to include the possibility for a prefixed character with '?'. The '?' means 'zero or one' when used as a quantifier (but it can also modify other quantifiers, but that's a different topic). To do that, your pattern would be r"[(+][XYZ]?([XYZ]\d\d\d)[)+]"

Upvotes: 1

Tim Biegeleisen
Tim Biegeleisen

Reputation: 520878

Use re.findall with the pattern [XYZ]\d{3}\b:

test = '"(YZ101+Z102+Z1034+Z104)/4"'
matches = re.findall(r'[XYZ]\d{3}\b', test)
print(matches)  # ['Z101', 'Z102', 'Z104']

Upvotes: 2

Related Questions