user6189164
user6189164

Reputation: 667

Perl6 Parse File

As practice, I'm trying to parse some standard text that is an output of a shell command.

  pool: thisPool
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
    still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
    the pool may no longer be accessible by software that does not support
    the features. See zpool-features(5) for details.
  scan: none requested
config:

    NAME                                                STATE     READ WRITE CKSUM
    homePool                                            ONLINE       0     0     0
      mirror-0                                          ONLINE       0     0     0
        ata-WDC_WD5000AZLX-00CL5A0_WD-WCC3F7NUE93C      ONLINE       0     0     0
        ata-WDC_WD5000AZLX-00CL5A0_WD-WCC3F7RE2A4F      ONLINE       0     0     0
    cache
      ata-KINGSTON_SV300S37A60G_50026B7261025D7E-part3  ONLINE       0     0     0

errors: No known data errors

I want to use a Perl6 grammar and I want to capture each of the fields in a separate token or regex. So, I made the following grammar:

grammar zpool {
        regex TOP { \s+ [ <keyword> <collection> ]+ }
        token keyword { "pool: " | "state: " | "status: " | "action: " | "scan: " | "config: " | "errors: " }
        regex collection { [<:!keyword>]*  }
}

My idea is that the regex finds a keyword, then begins collecting all the data until the next keyword. However, each time, I just get "pool: " -> all the remaining text.

 keyword => 「pool: 」
 collection => 「homePool
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
    still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
    the pool may no longer be accessible by software that does not support
    the features. See zpool-features(5) for details.
  scan: none requested
config:

    NAME                                                STATE     READ WRITE CKSUM
    homePool                                            ONLINE       0     0     0
      mirror-0                                          ONLINE       0     0     0
        ata-WDC_WD5000AZLX-00CL5A0_WD-WCC3F7NUE93C      ONLINE       0     0     0
        ata-WDC_WD5000AZLX-00CL5A0_WD-WCC3F7RE2A4F      ONLINE       0     0     0
    cache
      ata-KINGSTON_SV300S37A60G_50026B7261025D7E-part3  ONLINE       0     0     0

errors: No known data errors
」

I don't know how to get it to stop eating the characters when it finds a keyword and then treat that as another keyword.

Upvotes: 3

Views: 233

Answers (2)

raiph
raiph

Reputation: 32404

Problem 1

You've written <:!keyword> instead of <!keyword>. That's not what you want. You need to delete the :.

The <:foo> syntax in a P6 regex matches a single character with the specified Unicode property, in this case the property :foo which in turn means :foo(True).

And <:!keyword> matches a single character with the Unicode property :keyword(False).

But there is no Unicode property :keyword.

So the negative assertion will always be true and will always match a single character of input each time.

So the pattern just munches its way thru the rest of the text, as you know.

Problem 2

Once you fix problem 1, a second problem arises.

<:!keyword> matches a single character with the Unicode property :keyword(False). It automatically munches some input (a single character) each time it matches.

In contrast, <!keyword> does not consume any input if it matches. You have to make sure the pattern that uses it munches input.


After fixing those two problems you'll get the sort of output you expected. (The next problem you'll see is that the config keyword doesn't work because the : in config: in your input file example isn't followed by a space.)


So, with a few clean ups:

my @keywords = <pool state status action scan config errors> ;

say grammar zpool {
    token TOP        { \s+ [ <keyword> <collection> ]* }
    token keyword    { @keywords ': ' }
    token collection { [ <!keyword> . ]* }
}

I've switched all the patterns to token declarations. In general, always use token unless you know you need something else. (regex enables backtracking. That can dramatically slow things down if you're not careful. rule makes spaces in the rule significant.)

I've extracted the keywords into an array. @keywords means @keywords[0] | @keywords[1] | ....

I've added a . after <!keyword> in the last pattern (to consume a character's worth of input, to avoid the infinite loop that would otherwise occur given that <!foo> does not consume any input).

In case you haven't seen them, note that the available grammar debugging options are your friend.

Hth

Upvotes: 6

moritz
moritz

Reputation: 12842

As much as I love a good grammar now and then, this is much easier to solve with a call to split:

my $input = q:to/EOF/;
  pool: thisPool
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
    still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
    the pool may no longer be accessible by software that does not support
    the features. See zpool-features(5) for details.
  scan: none requested
config:

    NAME                                                STATE     READ WRITE CKSUM
    homePool                                            ONLINE       0     0     0
      mirror-0                                          ONLINE       0     0     0
        ata-WDC_WD5000AZLX-00CL5A0_WD-WCC3F7NUE93C      ONLINE       0     0     0
        ata-WDC_WD5000AZLX-00CL5A0_WD-WCC3F7RE2A4F      ONLINE       0     0     0
    cache
      ata-KINGSTON_SV300S37A60G_50026B7261025D7E-part3  ONLINE       0     0     0

errors: No known data errors
EOF

my @delimiter = <pool state status action scan config errors>;
my %fields;
for $input.split( / ^^ \h* (@delimiter) ':' \h*/, :v)[1..*] -> $key, $value {
    %fields{ $key[0] } = $value.trim;
}

say %fields.perl;

This works by splitting on known keys, discarding the first element (since we know the input starts with a key, not a value), and then iterating keys and values in lockstep.

Now since you asked for a grammar, we can easily turn the split call to a pure regex by replacing each value with .+? (any string, but as short as possible).

And now let's give it some more structure:

my @delimiter = <pool state status action scan config errors>;
grammar ZPool {
    regex key      { @delimiter             }
    regex keychunk { ^^ \h* <key> ':'       }
    regex value    { .*?                    }
    regex chunks   { <keychunk> \h* <value> }
    regex TOP      { <chunks>+              }
}

We could do the hard work of extracting the result from the nested match tree, or instead cheat with a stateful action object:

class ZPool::Actions {
    has $!last-key;
    has %.contents;
    method key($m)   { $!last-key = $m.Str                }
    method value($m) { %!contents{ $!last-key } = $m.trim }
}

And then use it:

my $actions = ZPool::Actions.new;
ZPool.parse($input, :$actions);
say $actions.contents.perl;

key and keychunk don't need to backtrack, so you can change them from regex to token.

Of course, using .+? and backtracking could be considered cheating, so you can use the trick that raiph mentioned with a negative look-ahead inside the value regex.

Upvotes: 3

Related Questions