flounder
flounder

Reputation: 141

Stop regex at first instance (awk)

I have lines of the form

XXXXXXXXXXXXXXXwordYYYYYYYYYYYYYYYYYYYYYYYYY<R>ZZZZZ
XXXXXXXXXXXXXXX[[YYYYYYYYYYYYYYYYYYYYYYYYYYYYY<R>ZZZZZ

I don't want to get into the syntax issues, but what I want to do with any line that contains <R> is replace it with the following text

XXXXXXXXXXXXXXX{wordYYYYYYYYYYYYYYYYYYYYYYYYYZZZZZ}
XXXXXXXXXXXXXXX{[[YYYYYYYYYYYYYYYYYYYYYYYYYYYYYZZZZZ}

Getting rid of the <R> is trivial:

str = $0
sub(/<R>/, "", str)
print str

Assume that the string is created by a program that I have no control over, and the transformed representation is processed by yet another program, and I have somehow (by magic) transform the output of program A into suitable syntax for program B, e.g.

A ... | awk ... | B ...

Somewhere between the sub and the print, I want to surround the data with {} as indicated. The sequence of XXX...XXX, YYY...YY and ZZ...ZZ are arbitrary character sequences of arbitrary length, so I want to split the string at the word "word" or at the first [, and retain those characters in the result string. Nothing I have found seems to quite answer this question. The closing } always goes at the end of the line, so that's equally trivial to deal with.

Note: This is a simplified description of a far more complicated syntax, but describing the details of the syntax would not be productive.

Upvotes: 0

Views: 243

Answers (4)

thanasisp
thanasisp

Reputation: 5975

If all you want is to surround the last part of the line (starting with word or [) with {}, you could use the GNU awk string function gensub().

gensub() provides an additional feature that is not available in sub() or gsub(): the ability to specify components of a regexp in the replacement text.

awk '{ print gensub(/([word|\[].+)$/, "{&}", "g", $0) }' file

Putting it together with your existing code for deleting <R>:

awk '{
    str = $0
    sub(/<R>/, "", str)
    print gensub(/([word|\[].+)$/, "{&}", "g", str)
}' file

output:

XXXXXXXXXXXXXXX{wordYYYYYYYYYYYYYYYYYYYYYYYYYZZZZZ}
XXXXXXXXXXXXXXX{[[YYYYYYYYYYYYYYYYYYYYYYYYYYYYYZZZZZ}

Note: I have assumed that your sample input is two lines, so regex matches until end of line ($). If it is one line, you just have to modify the end of the regex.

Upvotes: 0

Ed Morton
Ed Morton

Reputation: 204015

With a sed that has a -E arg to support EREs, e.g. GNU or OSX/BSD sed:

$ sed -E 's/((word|\[\[).*)<R>(.*)/{\1\3}/' file
XXXXXXXXXXXXXXX{wordYYYYYYYYYYYYYYYYYYYYYYYYYZZZZZ}
XXXXXXXXXXXXXXX{[[YYYYYYYYYYYYYYYYYYYYYYYYYYYYYZZZZZ}

With a POSIX sed:

$ sed 's/\(\(word\|\[\[\).*\)<R>\(.*\)/{\1\3}/' file
XXXXXXXXXXXXXXX{wordYYYYYYYYYYYYYYYYYYYYYYYYYZZZZZ}
XXXXXXXXXXXXXXX{[[YYYYYYYYYYYYYYYYYYYYYYYYYYYYYZZZZZ}

Upvotes: 1

potong
potong

Reputation: 58483

This might work for you (GNU sed):

sed -E 's/^(.*)(word.*)<R>(.*) \1(\[.*)<R>\3$/\1{\2\3}\n\1{\4\3}/' file

Pattern match on a line and substitute using back references and groupings if a match is successful.

N.B. The back references \1 and \3 are used in the LHS of the regexp.


The use of Y's in the question are inconsistent i.e. different length.

Upvotes: 0

anubhava
anubhava

Reputation: 785601

You may use this awk with alternation regex:

awk '{sub(/word|\[\[/, "{&"); sub(/<R>/, ""); sub(/$/, "}")} 1' file
XXXXXXXXXXXXXXX{wordYYYYYYYYYYYYYYYYYYYYYYYYYZZZZZ}
XXXXXXXXXXXXXXX{[[YYYYYYYYYYYYYYYYYYYYYYYYYYYYYZZZZZ}

This sed should also work for you:

sed -E 's/word|\[\[/{&/; s/<R>//; s/$/}/' file

Upvotes: 0

Related Questions