How to efficiently parse a word that includes the majority of unicode characters?

Question

I'm on Python 3.7 and pyparsing==2.4.2

I essentially want to parse the following but in an efficient manner:

import pyparsing as pp


content = pp.OneOrMore(
    pp.Word(pp.pyparsing_unicode.printables, excludeChars="#<>;")
)

The above is about 100 times slower than

content = pp.OneOrMore(
    pp.Word(pp.printables, excludeChars="#<>;")
)

Using pp.CharsNotIn is reasonably fast again but behaves in a way that is somewhat different from pp.Word. If I include whitespace in the unmatched characters (such that I get separate tokens) it does not combine nicely with pp.OneOrMore.

content = pp.OneOrMore(
    pp.CharsNotIn(" 	
#<>;")
)

leads to ParseException when parsing, for example,

parser.content.parseString("foo bar", parseAll=True)

pyparsing.ParseException: Expected end of text, found 'b'  (at char 4), (line:1, col:5)

Is there any good strategy for this scenario?

PaulMcG · Accepted Answer

I wanted to be sure that your performance testing was keeping separate the time to create the expression and the time to use it for parsing. (I also tried out two other Regex formats, described below.):

Create Word expression 6.56244158744812
Create Regex expression 0.0
Create Regex2 expression 3.991360902786255
Create Regex3 expression 0.4946744441986084

Parsing using Word expression
3.837733268737793
['foo', 'bar', '中文']
Parsing using Regex expression "[^ <>#;]+" 
0.07877945899963379
['foo', 'bar', '中文']
Parsing using Regex2 expression "[pp.pyparsing_unicode.printables]+"
3.8447225093841553
['foo', 'bar', '中文']
Parsing using Regex3 expression "[pp.pyparsing_unicode.printables converted to ranges]+"
0.07676076889038086
['foo', 'bar', '中文']

You can see that both parse the test string correctly, but the Regex is about 40X faster. I also tested using a Regex created from "[" + pp.pyparsing_unicode.printables + "]+" and this ended up being about the same as the Word expression.

Finally I tested using a Regex created by converting pp.pyparsing_unicode.printables to actual re ranges, instead of just one big million-character re range (like converting a regex of alphanums from "[ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789]+" to "[A-Za-z0-9]+").

This ends up being comparable to the negated range match, leading me to believe that converting character lists to re ranges is a potential speedup for parsing Words in general (with a small penalty at parser create time).

How to efficiently parse a word that includes the majority of unicode characters?

Answers (1)

Related Questions