Reputation: 5425
I have a string like this:
<hello<world<1 \< 2>
, which represents a list of three strings "hello", "world", "1 < 2"
. I want my regular expression to be able to match ("hello", "world", "1 \< 2")
. (I will remove the backslashes later in evaluation). I'm using the following regular expression to match the text:
r"(?:<((?:[^<>]|\\.)*))+>"
The way I understand it, it matches at least one (<
with any number of non-<>
or \anything
after it) and then a closing >
, but the results do not suggest that. Using re.match(..., ...).groups()
, I get the following:
>>> import re
>>> re.match(r"(?:<((?:[^<>]|\\.)*))+>", r"<hello<world<1 \< 2>").groups()
<<< (' 2',)
>>> re.match(r"(?:<((?:[^<>]|\\.)*))+>", r"<hello<world<1 \< 2>").group(0)
<<< '<hello<world<1 \\< 2>'
What's confusing is that group(0)
isn't even in groups()
, and it appears that the rest of the substrings aren't in group(...)
. Is something wrong with my regular expression or approach, and how should I fix it?
To be clear, I'm building a lexer for a golfing language using regex, so replacing it with something like a char-by-char lexer would be inconvenient since I already have the regular expression lexer and most of the expressions set up. I'm wondering if a pure regex solution is possible.
Upvotes: 0
Views: 212
Reputation: 71451
You can try this:
s = "<hello<world<1 \< 2>"
import re
l = [i for i in re.split("\<(?!\s\d)|\>", s) if i]
Output:
['hello', 'world', '1 \\< 2']
Upvotes: 1