user1054158
user1054158

Reputation:

REGEX - String version of the parsing of a pattern done by `sre_parse`

The following code will help me to parse patterns to be used with the standard module re.

import sre_parse

pattern = r"(?P<TEST>test)\s+\w*(?P=TEST)|abcde"

parsedpattern = sre_parse.parse(pattern)
parsedpattern.dump()

In a terminal, this gives an easy to parse text.

branch 
  subpattern 1 
    literal 116 
    literal 101 
    literal 115 
    literal 116 
  max_repeat 1 2147483647 
    in 
      category category_space
  max_repeat 0 2147483647 
    in 
      category category_word
  groupref 1 
or
  literal 97 
  literal 98 
  literal 99 
  literal 100 
  literal 101 

Is there an easy way to have this text as a string variable ? I can use the code of the method dump which is given by applying inspect.getsourcelines to sre_parse.SubPattern thanks to the module inspect. But I'm hopping a more direct solution if there is one.

PS : I have not found any readable documentation about the module sre_parse. Do you know anyone ?

Upvotes: 0

Views: 1080

Answers (1)

senshin
senshin

Reputation: 10360

You could always mess around with sys.stdout and redirect it to a variable, in a way:

import sre_parse
import sys

class PseudoStdout:
    def __init__(self):
        self.contents = ''
    def __enter__(self): # this and __exit__ are for context management
        self.old_stdout = sys.stdout
        sys.stdout = self
    def __exit__(self, type_, value, traceback):
        sys.stdout = self.old_stdout
    def write(self, text): # magic method that makes it behave like a file
        self.contents += text

pattern = r"(?P<TEST>test)\s+\w*(?P=TEST)|abcde"
parsedpattern = sre_parse.parse(pattern)

ps = PseudoStdout()
with ps:
    parsedpattern.dump()

print(repr(ps.contents))

Result:

'branch \n  subpattern 1 \n    literal 116 \n    literal 101 \n    literal 115 \n    literal 116 \n  max_repeat 1 65535 \n    in \n      category category_space\n  max_repeat 0 65535 \n    in \n      category category_word\n  groupref 1 \nor\n  literal 97 \n  literal 98 \n  literal 99 \n  literal 100 \n  literal 101 \n'

It seems more straightforward, though, to just step through parsedpattern itself, which is already structured.

Upvotes: 3

Related Questions