jbdev
jbdev

Reputation: 47

python regex: capturing group within OR

I'm using python and the re module to parse some strings and extract a 4 digits code associated with a prefix. Here are 2 examples of strings I would have to parse:

str1 = "random stuff tokenA1234 more stuff"
str2 = "whatever here tokenB5678 tokenA0123 and more there"

tokenA and tokenB are the prefixes and 1234, 5678, 0123 are the digits I need to grab. token A and B are just an example here. The prefix can be something like an address http://domain.com/ (tokenA) or a string like Id: ('[Ii]d:?\s?') (tokenB).

My regex looks like:

re.findall('.*?(?:tokenA([0-9]{4})|tokenB([0-9]{4})).*?', str1)

When parsing the 2 strings above, I get:

[('1234','')]
[('','5678'),('0123','')]

And I'd like to simply get ['1234'] or ['5678','0123'] instead of a tuple. How can I modify the regex to achieve that? Thanks in advance.

Upvotes: 3

Views: 157

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626804

You get tuples as a result since you have more than 1 capturing group in your regex. See re.findall reference:

If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.

So, the solution is to use only one capturing group.

Since you have tokens in your regex, you can use them inside a group. Since only tokens differ, ([0-9]{4}) part is common for both, just use an alternation operator between tokens put into a non-capturing group:

(?:tokenA|tokenB)([0-9]{4})
^^^^^^^^^^^^^^^^^

The regex means:

  • (?:tokenA|tokenB) - match but not capture tokenA or tokenB
  • ([0-9]{4}) - match and capture into Group 1 four digits

IDEONE demo:

import re
s = "tokenA1234tokenB34567"
print(re.findall(r'(?:tokenA|tokenB)([0-9]{4})', s)) 

Result: ['1234', '3456']

Upvotes: 1

Avinash Raj
Avinash Raj

Reputation: 174706

Simply do this:

re.findall(r"token[AB](\d{4})", s)

Put [AB] inside a character class, so that it would match either A or B

Upvotes: 1

Related Questions