Processing whitespace or comma delimited list of tokens with arpeggio

Question

I'm trying to write an arpeggio grammar that will extract tokens that can be delimited by either commas or whitespace. That is, tokens can be separated by commas, like this:

a,b,c

whitespace, like this:

a b  c

or a combination, like this:

a, b c

All of the above would produce the three tokens "a", "b", and "c". I also want to allow empty tokens, so that two commas with nothing but whitespace between them would produce an empty token:

"a,b,, c" -> ["a", "b", "", "c"]

I've defined my arpeggio grammar like this:

def token(): return RegExMatch('[^\s,]*')
def sep(): return RegExMatch('\s*[\s,]\s*')
def token_list(): return token, ZeroOrMore(sep, token)
def tokens(): return OneOrMore(token_list), EOF
parser = ParserPython(tokens)

and implemented a very simple visitor like this:

class TokenVisitor(PTNodeVisitor):
    def visit_token_list(self, node, children):
        return list(take_nth(2, children))

and a top level function like this:

def tokenize(string):
    tree = parser.parse(string)
    return visit_parse_tree(tree, TokenVisitor())

This all works fine on these examples:

tokenize('a,b,c') # [u'a', u'b', u'c']
tokenize('a, b ,c') # [u'a', u'b', u'c']

However, the following examples give me strange output:

tokenize('a,b c') # u'a | , | b | c | '
tokenize('a,b c') # u'a | b | c | '
tokenize('a,b,,c') # [u'a', u'b', u',']

There may be something about how arpeggio deals with whitespace and empty strings that I don't understand. How can I fix my grammar to parse all of these examples correctly?

Processing whitespace or comma delimited list of tokens with arpeggio

Answers (1)

Related Questions