Repeating regex pattern

Question

I have a string such as this

word aaa word aaa-bbb=ccc

where, if there is one ore more words enclosed in tags. In those instances where there are more than one words (which are usually separated by - or = and potentially other non-word characters), I'd like to make sure that the tags enclose each word individually so that the resulting string would be:

word aaa word aaa-bbb=ccc

So I'm trying to come up with a regex that would find any number of iterations of \W*?(\w+) and then enclose each word individually with the tags. And ideally I'd have this as a one-liner that I can execute from the command line with perl, like so:

perl -pe 's///g;' in out

This is how far I've gotten after a lot of trial and error and googling - I'm not a programmer :( ... :

/\W*?(\w+)\W*?((\w+)\W*?){0,10}<\/gl>/

It finds the first and last word (aaa and ccc). Now, how can I make it repeat the operation and find other words if present? And then how to get the replacement? Any hints on how to do this or where I can find further information would be much appreciated?

EDIT: This is part of a workflow that does some other transformations within a shell script:

#!/bin/sh

perl -pe '# 
  s/replace/me/g;  
  s/replace/me/g;  
  ' $1 > tmp

... some other commands ...

hepcat72 · Accepted Answer

This will do it:

s/(\w+)([\-=])(?=\w+)/$1<\/gl>$2/g;

The /g at the end is the repeat and stands for "global". It will pick up matching at the end of the previous match and keep matching until it doesn't match anymore, so we have to be careful about where the match ends. That's what the (?=...) is for. It's a "followed by pattern" that tells the repeat to not include it as part of "where you left off" in the previous match. That way, it picks up where it left off by re-matching the second "word".

The s/ at the beginning is a substitution, so the command would be something like:

cat in | perl -pne 's/(\w+)([\-=])(?=\w+)/$1<\/gl>$2/g;$_' > out

You need the $_ at the end because the result of the global substitution is the number of substitutions made.

This will only match one line. If your pattern spans multiple lines, you'll need some fancier code. It also assumes the XML is correct and that there are no words surrounding dashes or equals signs outside of tags. To account for this would necessitate an extra pattern match in a loop to pull out the values surrounded by gl tags so that you can do your substitution on just those portions, like:

my $e = $in;
while($in =~ /(.*?)(.*?)(?=<\/gl>)/g){
    my $p = $1;
    my $s = $2;
    print($p);
    $s =~ s/(\w+)([\-=])(?=\w+)/$1<\/gl>$2/g;
    print($s);
    $e = $';   # ' (stop markup highlighter)
}
print($e);

You'd have to write your own surrounding loop to read STDIN and put the lines read in into $in. (You would also need to not use -p or -n flags to the perl interpreter since you're reading the input and printing the output manually.) The while loop above however grabs everything inside the gl tags and then performs your substitution on just that content. It prints everything occurring between the last match (or the beginning of the string) and before the current match ($p) and saves everything after in $e which gets printed after the last match outside the loop.

Repeating regex pattern

Answers (2)

Related Questions