TatSu: How to optimize the following grammar logic for faster parse time?

Question

I have the following grammar in TatSu. To reduce parse time, I implemented cut operations (i.e., commit to a particular rule option once a particular token is seen).

However, I still see long runtimes. On a file with about 830K lines, it takes about 25min (without cut expressions it was close to 40min). I think further improvement is possible but I am not sure how to rewrite the following logic in a bettery way.

The main issue that I believe is taking the bulk of the time (by observing the TatSu grammar matching traces) is the vec_data_string/vec_data_strings. Any suggestions on how to improve this further?

pattern_stmt
    =
    (('V'|'C')&'{' | 'Vector' | 'Condition') '{' ~ {vec_data_block} '}'
    | ('Macro' | 'Call') identifier '{' ~ {vec_data_block} '}'
    | ('Macro' | 'Call') identifier ';'
    | ('W' | 'WaveformTable') ~ identifier ';'
    | annotation
    | 'Loop' ~ integer '{' ~ [pattern_statements] '}'
    | 'MatchLoop' ~ ('Infinite' | integer) ~ '{' ~ pattern_statements 'BreakPoint' ~ '{' ~ pattern_statements '}' ~ '}'
    | ('GoTo' | 'ScanChain') ~ identifier ';'
    | 'BreakPoint' '{' ~ pattern_statements '}'
    | ('BreakPoint' | 'IddqTestPoint' | 'Stop') ~ ';'
    | 'TimeUnit' ~ "'" ~ number [siunit] "'" ~ ';'
    ;
vec_data_block
        =
        | signal_reference_expr '=' ~ vec_data_string ';'
        signal_reference_expr '{' ~ {vec_data_strings} '}'
    vec_data_strings
        =
        {vec_data_string ';'}+
        ;
    vec_data_string
        =
        {wfc_data}+
        | {hex_data}+
        | {dec_data}+
        ;
    wfc_data
        =
        ['\r' ~ integer] wfcs
        | hex_mode
        | dec_mode
        ;
    hex_data
        =
        ['\r' ~ integer] hex
        | wfc_mode
        | dec_mode
        ;
    dec_data
        =
        ['\r' ~ integer] integer
        | wfc_mode
        | hex_mode
        ;
    hex_mode
        =
        '\h' ~ [wfcs] {hex_data}+
        ;
    wfc_mode
        =
        '\w' ~ {wfc_data}+
        ;
    dec_mode
        =
        '\d' ~ [wfcs] {dec_data}+
        ;
    wfcs
        =
        /[a-zA-Z0-9#%]+/
        ;
    hex
        =
        /[0-9a-fA-F]+/
        ;

    integer::int
        =
        /\d+/
        ;

My test file has lot of sequences like this:

 "ALLPIs" = 001 
27 Z 10001ZZ0 
22 Z 0 
22 Z 0 
22 Z 0 
20 Z 1111 
133 Z 0Z0010; 
    "ALLPOs" = 
243 X ; 
    "ALLCIOs" = 
557 Z 0 
10 Z 0ZZ0001001 
19 Z ;

Apalala · Accepted Answer

You can considerably reduce the number of calls by factoring some common subexpressions.

For example:

    | ('Macro' | 'Call') identifier '{' ~ {vec_data_block} '}'
    | ('Macro' | 'Call') identifier ';'

can be written as:

    | ('Macro' | 'Call') identifier (';' | '{' ~ {vec_data_block} '}')

Using TatSu's support for reserved words should also help.

TatSu: How to optimize the following grammar logic for faster parse time?

Answers (2)

Related Questions