Reputation: 47
I'm using python and the re
module to parse some strings and extract a 4 digits code associated with a prefix. Here are 2 examples of strings I would have to parse:
str1 = "random stuff tokenA1234 more stuff"
str2 = "whatever here tokenB5678 tokenA0123 and more there"
tokenA and tokenB are the prefixes and 1234, 5678, 0123 are the digits I need to grab. token A and B are just an example here. The prefix can be something like an address http://domain.com/
(tokenA) or a string like Id:
('[Ii]d:?\s?'
) (tokenB).
My regex looks like:
re.findall('.*?(?:tokenA([0-9]{4})|tokenB([0-9]{4})).*?', str1)
When parsing the 2 strings above, I get:
[('1234','')]
[('','5678'),('0123','')]
And I'd like to simply get ['1234']
or ['5678','0123']
instead of a tuple.
How can I modify the regex to achieve that? Thanks in advance.
Upvotes: 3
Views: 157
Reputation: 626804
You get tuples as a result since you have more than 1 capturing group in your regex. See re.findall
reference:
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.
So, the solution is to use only one capturing group.
Since you have tokens in your regex, you can use them inside a group. Since only tokens differ, ([0-9]{4})
part is common for both, just use an alternation operator between tokens put into a non-capturing group:
(?:tokenA|tokenB)([0-9]{4})
^^^^^^^^^^^^^^^^^
The regex means:
(?:tokenA|tokenB)
- match but not capture tokenA
or tokenB
([0-9]{4})
- match and capture into Group 1 four digitsimport re
s = "tokenA1234tokenB34567"
print(re.findall(r'(?:tokenA|tokenB)([0-9]{4})', s))
Result: ['1234', '3456']
Upvotes: 1
Reputation: 174706
Simply do this:
re.findall(r"token[AB](\d{4})", s)
Put [AB]
inside a character class, so that it would match either A
or B
Upvotes: 1