hyperneutrino
hyperneutrino

Reputation: 5425

Python regular expression not returning all groups

I have a string like this:

<hello<world<1 \< 2>, which represents a list of three strings "hello", "world", "1 < 2". I want my regular expression to be able to match ("hello", "world", "1 \< 2"). (I will remove the backslashes later in evaluation). I'm using the following regular expression to match the text:

r"(?:<((?:[^<>]|\\.)*))+>"

The way I understand it, it matches at least one (< with any number of non-<> or \anything after it) and then a closing >, but the results do not suggest that. Using re.match(..., ...).groups(), I get the following:

>>> import re
>>> re.match(r"(?:<((?:[^<>]|\\.)*))+>", r"<hello<world<1 \< 2>").groups()
<<< (' 2',)
>>> re.match(r"(?:<((?:[^<>]|\\.)*))+>", r"<hello<world<1 \< 2>").group(0)
<<< '<hello<world<1 \\< 2>'

What's confusing is that group(0) isn't even in groups(), and it appears that the rest of the substrings aren't in group(...). Is something wrong with my regular expression or approach, and how should I fix it?

To be clear, I'm building a lexer for a golfing language using regex, so replacing it with something like a char-by-char lexer would be inconvenient since I already have the regular expression lexer and most of the expressions set up. I'm wondering if a pure regex solution is possible.

Upvotes: 0

Views: 212

Answers (1)

Ajax1234
Ajax1234

Reputation: 71451

You can try this:

s = "<hello<world<1 \< 2>"
import re
l = [i for i in re.split("\<(?!\s\d)|\>", s) if i]

Output:

['hello', 'world', '1 \\< 2']

Upvotes: 1

Related Questions