jan
jan

Reputation: 259

Repeating regex pattern

I have a string such as this

word <gl>aaa</gl> word <gl>aaa-bbb=ccc</gl>

where, if there is one ore more words enclosed in tags. In those instances where there are more than one words (which are usually separated by - or = and potentially other non-word characters), I'd like to make sure that the tags enclose each word individually so that the resulting string would be:

word <gl>aaa</gl> word <gl>aaa</gl>-<gl>bbb</gl>=<gl>ccc</gl>

So I'm trying to come up with a regex that would find any number of iterations of \W*?(\w+) and then enclose each word individually with the tags. And ideally I'd have this as a one-liner that I can execute from the command line with perl, like so:

perl -pe 's///g;' in out

This is how far I've gotten after a lot of trial and error and googling - I'm not a programmer :( ... :

/<gl>\W*?(\w+)\W*?((\w+)\W*?){0,10}<\/gl>/

It finds the first and last word (aaa and ccc). Now, how can I make it repeat the operation and find other words if present? And then how to get the replacement? Any hints on how to do this or where I can find further information would be much appreciated?

EDIT: This is part of a workflow that does some other transformations within a shell script:

#!/bin/sh

perl -pe '# 
  s/replace/me/g;  
  s/replace/me/g;  
  ' $1 > tmp

... some other commands ...

Upvotes: 2

Views: 397

Answers (2)

zdim
zdim

Reputation: 66964

This needs a mini nested-parser and I'd recommend a script, as easier to maintain

use warnings;
use strict;
use feature 'say';

my $str = q(word <gl>aaa</gl> word <gl>aaa-bbb=ccc</gl>);

my $tag_re = qr{(<[^>]+>) (.+?) (</[^>]+>)}x;   # / (stop markup highlighter)

$str =~ s{$tag_re}{
    my ($o, $t, $c) = ($1, $2, $3);  # open (tag), text, close (tag)
    $t =~ s/(\w+)/$o$1$c/g; 
    $t;
}ge;

say $str;

The regex gives us its built-in "parsing," where words that don't match the $tag_re are unchanged. Once the $tag_re is matched, it is processed as required inside the replacement side. The /e modifier makes the replacement side be evaluated as code.

One way to provide input for a script is via command-line arguments, available in @ARGV global array in the script. For the use indicated in the question's "Edit" replace the hardcoded

my $str = q(...);

with

my $str = shift @ARGV;  # first argument on the command line

and then use that script in your shell script as

#!/bin/sh
...
script.pl $1 > output_file

where $1 is the shell variable as shown in the "Edit" to the question.


In a one-liner

echo "word <gl>aaa</gl> word <gl>aaa-bbb=ccc</gl>"  |
    perl -wpe'
        s{(<[^>]+>) (.+?) (</[^>]+>)}
         {($o,$t,$c)=($1,$2,$3);$t=~s/(\w+)/$o$1$c/g; $t}gex;
    '

what in your shell script becomes   echo $1 | perl -wpe'...' > output_file. Or you can change the code to read from @ARGV and drop the -n switch, and add a print

#!/bin/sh
...
perl -wE'$_=shift; ...; say' $1 > output_file 

where ... in one-liner indicate the same code as above, and say is now needed since we don't have the -p with which the $_ is printed out once it's processed.

The shift takes an element off of an array's front and returns it. Without an argument it does that to @ARGV when outside a subroutine, as here (inside a subroutine its default target is @_).

Upvotes: 2

hepcat72
hepcat72

Reputation: 1136

This will do it:

s/(\w+)([\-=])(?=\w+)/$1<\/gl>$2<gl>/g;

The /g at the end is the repeat and stands for "global". It will pick up matching at the end of the previous match and keep matching until it doesn't match anymore, so we have to be careful about where the match ends. That's what the (?=...) is for. It's a "followed by pattern" that tells the repeat to not include it as part of "where you left off" in the previous match. That way, it picks up where it left off by re-matching the second "word".

The s/ at the beginning is a substitution, so the command would be something like:

cat in | perl -pne 's/(\w+)([\-=])(?=\w+)/$1<\/gl>$2<gl>/g;$_' > out

You need the $_ at the end because the result of the global substitution is the number of substitutions made.

This will only match one line. If your pattern spans multiple lines, you'll need some fancier code. It also assumes the XML is correct and that there are no words surrounding dashes or equals signs outside of tags. To account for this would necessitate an extra pattern match in a loop to pull out the values surrounded by gl tags so that you can do your substitution on just those portions, like:

my $e = $in;
while($in =~ /(.*?<gl>)(.*?)(?=<\/gl>)/g){
    my $p = $1;
    my $s = $2;
    print($p);
    $s =~ s/(\w+)([\-=])(?=\w+)/$1<\/gl>$2<gl>/g;
    print($s);
    $e = $';   # ' (stop markup highlighter)
}
print($e);

You'd have to write your own surrounding loop to read STDIN and put the lines read in into $in. (You would also need to not use -p or -n flags to the perl interpreter since you're reading the input and printing the output manually.) The while loop above however grabs everything inside the gl tags and then performs your substitution on just that content. It prints everything occurring between the last match (or the beginning of the string) and before the current match ($p) and saves everything after in $e which gets printed after the last match outside the loop.

Upvotes: -1

Related Questions