user4979733
user4979733

Reputation: 3411

TatSu: How to optimize the following grammar logic for faster parse time?

I have the following grammar in TatSu. To reduce parse time, I implemented cut operations (i.e., commit to a particular rule option once a particular token is seen).

However, I still see long runtimes. On a file with about 830K lines, it takes about 25min (without cut expressions it was close to 40min). I think further improvement is possible but I am not sure how to rewrite the following logic in a bettery way.

The main issue that I believe is taking the bulk of the time (by observing the TatSu grammar matching traces) is the vec_data_string/vec_data_strings. Any suggestions on how to improve this further?

pattern_stmt
    =
    (('V'|'C')&'{' | 'Vector' | 'Condition') '{' ~ {vec_data_block} '}'
    | ('Macro' | 'Call') identifier '{' ~ {vec_data_block} '}'
    | ('Macro' | 'Call') identifier ';'
    | ('W' | 'WaveformTable') ~ identifier ';'
    | annotation
    | 'Loop' ~ integer '{' ~ [pattern_statements] '}'
    | 'MatchLoop' ~ ('Infinite' | integer) ~ '{' ~ pattern_statements 'BreakPoint' ~ '{' ~ pattern_statements '}' ~ '}'
    | ('GoTo' | 'ScanChain') ~ identifier ';'
    | 'BreakPoint' '{' ~ pattern_statements '}'
    | ('BreakPoint' | 'IddqTestPoint' | 'Stop') ~ ';'
    | 'TimeUnit' ~ "'" ~ number [siunit] "'" ~ ';'
    ;
vec_data_block
        =
        | signal_reference_expr '=' ~ vec_data_string ';'
        signal_reference_expr '{' ~ {vec_data_strings} '}'
    vec_data_strings
        =
        {vec_data_string ';'}+
        ;
    vec_data_string
        =
        {wfc_data}+
        | {hex_data}+
        | {dec_data}+
        ;
    wfc_data
        =
        ['\\r' ~ integer] wfcs
        | hex_mode
        | dec_mode
        ;
    hex_data
        =
        ['\\r' ~ integer] hex
        | wfc_mode
        | dec_mode
        ;
    dec_data
        =
        ['\\r' ~ integer] integer
        | wfc_mode
        | hex_mode
        ;
    hex_mode
        =
        '\\h' ~ [wfcs] {hex_data}+
        ;
    wfc_mode
        =
        '\\w' ~ {wfc_data}+
        ;
    dec_mode
        =
        '\\d' ~ [wfcs] {dec_data}+
        ;
    wfcs
        =
        /[a-zA-Z0-9#%]+/
        ;
    hex
        =
        /[0-9a-fA-F]+/
        ;

    integer::int
        =
        /\d+/
        ;

My test file has lot of sequences like this:

 "ALLPIs" = 001 \r27 Z 10001ZZ0 \r22 Z 0 \r22 Z 0 \r22 Z 0 \r20 Z 1111 \r133 Z 0Z0010; 
    "ALLPOs" = \r243 X ; 
    "ALLCIOs" = \r557 Z 0 \r10 Z 0ZZ0001001 \r19 Z ;

Upvotes: 1

Views: 250

Answers (2)

Apalala
Apalala

Reputation: 9244

You can considerably reduce the number of calls by factoring some common subexpressions.

For example:

    | ('Macro' | 'Call') identifier '{' ~ {vec_data_block} '}'
    | ('Macro' | 'Call') identifier ';'

can be written as:

    | ('Macro' | 'Call') identifier (';' | '{' ~ {vec_data_block} '}')

Using TatSu's support for reserved words should also help.

Upvotes: 2

user4979733
user4979733

Reputation: 3411

I was able to get the runtime down significantly (about 50%) by doing two things:

  1. Instead of trying to parse to create an AST for the long strings. I just read them all in one go using a regex expression. I will post-process the string later.
  2. I noticed that the bulk of time is being consumed by {vec_data_block}. After every match it tries to 4 more matches and fails before proceeding to match the '}' token. To get around this problem, I added a lookahead rule (&'}') to pass if it sees the '}' token. This kind of acts like the 'cut' expression and prevents the 4 failing matches + recursion from happening. This is the bulk of the speedup. Not sure if this is the correct way to do it, but it works.

Here is the updated grammar with the two changes:

pattern_stmt
    =
    (('V'|'C')&'{' | 'Vector' | 'Condition') '{' ~ {vec_data_block} '}'
    | ('Macro' | 'Call') identifier '{' ~ {vec_data_block} '}'
    | ('Macro' | 'Call') identifier ';'
    | ('W' | 'WaveformTable') ~ identifier ';'
    | annotation
    | 'Loop' ~ integer '{' ~ [pattern_statements] '}'
    | 'MatchLoop' ~ ('Infinite' | integer) ~ '{' ~ pattern_statements 'BreakPoint' ~ '{' ~ pattern_statements '}' ~ '}'
    | ('GoTo' | 'ScanChain') ~ identifier ';'
    | 'BreakPoint' '{' ~ pattern_statements '}'
    | ('BreakPoint' | 'IddqTestPoint' | 'Stop') ~ ';'
    | 'TimeUnit' ~ "'" ~ number [siunit] "'" ~ ';'
    ;
vec_data_block
    =
    | &'}'  <-- Short circuit the next 4 matches if lookahead and find a '}'
    | signal_reference_expr '=' ~ vec_data_string ';'
    | signal_reference_expr '{' ~ {vec_data_strings} '}'
    | annotation
    | non_cyclized_data
    ;
non_cyclized_data
    =
    '@' integer event_pair ';'
     | '@' integer '{' ~ ';'.{event_pair} '}'
    ;
event_pair
    =
    signal_reference_expr '=' ~ event
    | annotation
    ;
vec_data_strings
    =
    {vec_data_string ';'}+
    | annotation
    ;
foo
    =
    /[0-9A-Za-z#%\\r\\h\\d\\w]+/
    ;
vec_data_string
    =
    {foo}+   <-- Just read the tokens into one big string for post-processing later.
    ;

Upvotes: 2

Related Questions