Reputation: 7443
Got this text:
Want this || Not this
The line may also look like this:
Want this | Not this
with a single pipe.
I'm using this grammar to parse it:
grammar HC {
token TOP { <pre> <divider> <post> }
token pre { \N*? <?before <divider>> }
token divider { <[|]> ** 1..2 }
token post { \N* }
}
Is there a better way to do this? I'd love to be able to do something more like this:
grammar HC {
token TOP { <pre> <divider> <post> }
token pre { \N*? }
token divider { <[|]> ** 1..2 }
token post { \N* }
}
But this does not work. And if I do this:
grammar HC {
token TOP { <pre>* <divider> <post> }
token pre { \N }
token divider { <[|]> ** 1..2 } }
token post { \N* }
}
Each character before divider gets its own <pre>
capture. Thanks.
Upvotes: 4
Views: 163
Reputation: 7581
OK - I tried use Grammar::Tracer;
(our best friend!) and got this from your original and the first answer with regexes ... both seemed wrong to me...
TOP
| pre
| | divider
| | * FAIL
| | divider
| | * FAIL
| | divider
| | * FAIL
| | divider
| | * FAIL
| | divider
| | * FAIL
| | divider
| | * FAIL
| | divider
| | * FAIL
| | divider
| | * FAIL
| | divider
| | * FAIL
| | divider
| | * FAIL
| | divider
| | * MATCH "|"
| * MATCH "Want this "
| divider
| * MATCH "|"
| post
| * MATCH " Not this"
* MATCH "Want this | Not this"
「Want this | Not this」
pre => 「Want this 」
divider => 「|」
post => 「 Not this」
This gives me the feeling that your combination of pre and divider are not converging. So I altered the code to this (with a more definitive definition of pre)...
1 use Grammar::Tracer;
2
3 grammar HC {
4 token TOP { <pre> <divider> <post> }
5 token pre { <-[|]>* }
6 token divider { <[|]> ** 1..2 }
7 token post { \N* }
8 }
and got this...
TOP
| pre
| * MATCH "Want this "
| divider
| * MATCH "|"
| post
| * MATCH " Not this"
* MATCH "Want this | Not this"
「Want this | Not this」
pre => 「Want this 」
divider => 「|」
post => 「 Not this」
Sooo - I conclude that (i) using Grammar::Tracer to inspect the operation of Grammars is a must do, (ii) a loose definition like the original requires the parser to test on every char boundary should be avoided, (iii) especially if the divider is hard to pin down
I have the wider feeling that a Grammar (parser) may not be well suited to the underlying raw data structure and that a set of regexes may be a better approach.
I failed to work out how to use <.ws>
or equivalent to trim the empty spaces from the captured results.
Upvotes: 3
Reputation: 32414
As always, TIMTOWTDI.
I'd love to be able to do something more like this
You can. Just switch the first two rule declarations from token
to regex
:
grammar HC {
regex TOP { <pre> <divider> <post> }
regex pre { \N*? }
token divider { <[|]> ** 1..2 }
token post { \N* }
}
This works because regex
disables :ratchet
(unlike token
and rule
which enable it).
(Explaining why you need to switch it off for both rules is beyond my paygrade, certainly for tonight, and quite possibly till someone else explains why to me so I can pretend I knew all along.)
if I do this ... each character gets its own
<pre>
capture
By default, "calling a named regex installs a named capture with the same name" [... couple sentences later:] "If no capture is desired, a leading dot or ampersand will suppress it". So change <pre>
to <.pre>
.
Next, you can manually add a named capture by wrapping a pattern in $<name>=[pattern]
. So to capture the whole string matched by consecutive calls of the pre
rule, wrap the non-capturing pattern (<.pre>*?
) in $<pre>=[...]
):
grammar HC {
token TOP { $<pre>=[<.pre>*?] <divider> <post> }
token pre { \N }
token divider { <[|]> ** 1..2 }
token post { \N* }
}
Upvotes: 7