StevieD
StevieD

Reputation: 7443

Using a grammar to parse a string without lookahead?

Got this text:

Want this || Not this

The line may also look like this:

Want this | Not this

with a single pipe.

I'm using this grammar to parse it:

    grammar HC {
       token TOP {  <pre> <divider> <post> }
       token pre { \N*? <?before <divider>> }
       token divider { <[|]> ** 1..2 } 
       token post { \N* }
    } 

Is there a better way to do this? I'd love to be able to do something more like this:

    grammar HC {
       token TOP {  <pre> <divider> <post> }
       token pre { \N*? }
       token divider { <[|]> ** 1..2 }
       token post { \N* }
    } 

But this does not work. And if I do this:

    grammar HC {
       token TOP {  <pre>* <divider> <post> }
       token pre { \N }
       token divider { <[|]> ** 1..2 } }
       token post { \N* }
    } 

Each character before divider gets its own <pre> capture. Thanks.

Upvotes: 4

Views: 163

Answers (2)

librasteve
librasteve

Reputation: 7581

OK - I tried use Grammar::Tracer; (our best friend!) and got this from your original and the first answer with regexes ... both seemed wrong to me...

TOP
|  pre
|  |  divider
|  |  * FAIL
|  |  divider
|  |  * FAIL
|  |  divider
|  |  * FAIL
|  |  divider
|  |  * FAIL
|  |  divider
|  |  * FAIL
|  |  divider
|  |  * FAIL
|  |  divider
|  |  * FAIL
|  |  divider
|  |  * FAIL
|  |  divider
|  |  * FAIL
|  |  divider
|  |  * FAIL
|  |  divider
|  |  * MATCH "|"
|  * MATCH "Want this "
|  divider
|  * MATCH "|"
|  post
|  * MATCH " Not this"
* MATCH "Want this | Not this"
「Want this | Not this」
 pre => 「Want this 」
 divider => 「|」
 post => 「 Not this」

This gives me the feeling that your combination of pre and divider are not converging. So I altered the code to this (with a more definitive definition of pre)...

  1 use Grammar::Tracer;
  2 
  3 grammar HC {
  4        token TOP {  <pre> <divider> <post> }
  5        token pre {  <-[|]>* }
  6        token divider { <[|]> ** 1..2 }
  7        token post { \N* }
  8 }  

and got this...

TOP
|  pre
|  * MATCH "Want this "
|  divider
|  * MATCH "|"
|  post
|  * MATCH " Not this"
* MATCH "Want this | Not this"
「Want this | Not this」
 pre => 「Want this 」
 divider => 「|」
 post => 「 Not this」

Sooo - I conclude that (i) using Grammar::Tracer to inspect the operation of Grammars is a must do, (ii) a loose definition like the original requires the parser to test on every char boundary should be avoided, (iii) especially if the divider is hard to pin down

I have the wider feeling that a Grammar (parser) may not be well suited to the underlying raw data structure and that a set of regexes may be a better approach.

I failed to work out how to use <.ws> or equivalent to trim the empty spaces from the captured results.

Upvotes: 3

raiph
raiph

Reputation: 32414

As always, TIMTOWTDI.

I'd love to be able to do something more like this

You can. Just switch the first two rule declarations from token to regex:

grammar HC {
  regex TOP {  <pre> <divider> <post> }
  regex pre { \N*? }
  token divider { <[|]> ** 1..2 }
  token post { \N* }
} 

This works because regex disables :ratchet (unlike token and rule which enable it).

(Explaining why you need to switch it off for both rules is beyond my paygrade, certainly for tonight, and quite possibly till someone else explains why to me so I can pretend I knew all along.)

if I do this ... each character gets its own <pre> capture

By default, "calling a named regex installs a named capture with the same name" [... couple sentences later:] "If no capture is desired, a leading dot or ampersand will suppress it". So change <pre> to <.pre>.

Next, you can manually add a named capture by wrapping a pattern in $<name>=[pattern]. So to capture the whole string matched by consecutive calls of the pre rule, wrap the non-capturing pattern (<.pre>*?) in $<pre>=[...]):

grammar HC {
       token TOP { $<pre>=[<.pre>*?] <divider> <post> }
       token pre { \N }
       token divider { <[|]> ** 1..2 }
       token post { \N* }
    } 

Upvotes: 7

Related Questions