Mohamed Nuur
Mohamed Nuur

Reputation: 5655

perl stream file for regex token including scanned tokens

I am trying to stream a file in perl and tokenize the lines and include the tokens.

I have:

while( $line =~ /([\/][\d]*[%].*?[%][\d]*[\/]|[^\s]+|[\s]+)/g ) {
  my $word = $1;
  #...
}

But it doesn't work when there's no spaces in the token.

For example, if my line is:

$line = '/15%one (1)(2)%15/ is a /%good (1)%/ +/%number(2)%/.'

I would like to split that line into:

$output =
[
  '/15%one (1)(2)%15/',
  ' ',
  'is',
  ' ',
  'a',
  '/%good (1)%/',
  ' ',
  '+',
  '/%number(2)%/',
  '.'
]

What is the best way to do this?

Upvotes: 0

Views: 296

Answers (1)

ikegami
ikegami

Reputation: 385996

(?:(?!STRING).)* is to STRING as [^CHAR]* is to CHAR, so

my @tokens;
push @tokens, $1
   while $line =~ m{
      \G
      ( \s+
      | ([\/])([0-9]*)%
        (?: (?! %\3\2 ). )*
        %\3\2
      | (?: (?! [\/][0-9]*% )\S )+
      )
   }sxg;

but that doesn't validate. If you want to validate, you could use

my @tokens;
push @tokens, $1
   while $line =~ m{
      \G
      ( \s+
      | ([\/])([0-9]*)%
        (?: (?! %\3\2 ). )*
        %\3\2
      | (?: (?! [\/][0-9]*% )\S )+
      | \z (*COMMIT) (*FAIL)
      | (?{ die "Syntax error" })
      )
   }sxg;

The following also validates, but it's a bit more readable and makes it easy to differentiate the token types.:

my @tokens;
for ($line) {
   m{\G ( \s+ ) }sxgc
      && do { push @tokens, $1; redo };

   m{\G ( ([\/])([0-9]*)%  (?: (?! %\3\2 ). )*  %\3\2 ) }sxgc
      && do { push @tokens, $1; redo };

   m{\G ( (?: (?! [\/][0-9]*% )\S )+ ) }sxgc
      && do { push @tokens, $1; redo };

   m{\G \z }sxgc
      && last;

   die "Syntax error";
}

pos will get you information about where the error occurred.

Upvotes: 2

Related Questions