regex Pattern Matching over two lines - search and replace

Question

I have a text document that i require help with. In the below example is an extract of a tab delimited text doc whereby the first line of the 3 line pattern will always be a number. The Doc will always be in this format with the same tabbed formula on each of the three lines.

nnnn **variable** V -------
* FROM CLIP NAME - **variable**
* LOC: variable variable **variable**

I want to replace the second field on the first line with the fourth field on the third line. And then replace the field after the colon on the second line with the original second field on the first line. Is this possible with regex? I am used to single line search replace function but not multiline patterns.

000003  A009C001_151210_R6XO             V     C        11:21:12:17 11:21:57:14 01:00:18:22 01:01:03:19 
*FROM CLIP NAME:  5-1A 
*LOC: 01:00:42:15 WHITE   005_NST_010_E02 
000004  B008C001_151210_R55E             V     C        11:21:18:09 11:21:53:07 01:01:03:19 01:01:38:17 
*FROM CLIP NAME:  5-1B 
*LOC: 01:01:20:14 WHITE   005_NST_010_E03

The Result would look like :

000003  005_NST_010_E02             V     C        11:21:12:17 11:21:57:14 01:00:18:22 01:01:03:19 
*FROM CLIP NAME:  A009C001_151210_R6XO
*LOC: 01:00:42:15 WHITE   005_NST_010_E02 
000004  005_NST_010_E03             V     C        11:21:18:09 11:21:53:07 01:01:03:19 01:01:38:17 
*FROM CLIP NAME:  B008C001_151210_R55E 
*LOC: 01:01:20:14 WHITE   005_NST_010_E03

Many Thanks in advance.

e0k · Accepted Answer

A regular expression defines a regular language. Alone, this only expresses a structure of some input. Performing operations on this input requires some kind of processing tool. You didn't specify which tool you were using, so I get to pick.

Multiline `sed`

You wrote that you are "used to single line search replace function but not multiline patterns." Perhaps you are referring to substitution with sed. See How can I use sed to replace a multi-line string?. It is more complicated than with a single line, but it is possible.

An AWK script

AWK is known for its powerful one-liners, but you can also write scripts. Here is a script that identifies the beginning of a new record/pattern using a regular expression to match the first number. (I hesitate to call it a "record" because this has a specific meaning in AWK.) It stores the fields of the first two lines until it encounters the third line. At the third line, it has all the information needed to make the desired replacements. It then prints the modified first two lines and continues. The third line is printed unchanged (you specified no replacements for the third line). If there are additional lines before the start of the next record/pattern, they will also be printed unchanged.

It's unclear exactly where the tab characters are in your sample input because the submission system has replaced them with spaces. I am assuming there is a tab between FROM CLIP NAME: and the following field and that the "variables" on the first and third line are also tab-separated. If the first number of each record/pattern is hexadecimal instead of decimal, replace the [[:digit:]] with [[:xdigit:]].

fixit.awk

#!/usr/bin/awk -f

BEGIN { FS="	"; n=0 }
{n++}
/^[[:digit:]]+	/ { n=1 }

# Split and save first two lines
n==1 { line1_NF = split($0, line1, FS); next }
n==2 { line2_NF = split($0, line2, FS); next }
n==3 {
    # At the third line, make replacements
    line1_2 = line1[2]
    line1[2] = $4
    line2[2] = line1_2

    # Print modified first two lines
    printf "%s", line1[1]
    for ( i=2; i<=line1_NF; ++i )
        printf "	%s", line1[i]
    print ""
    printf "%s", line2[1]
    for ( i=2; i<=line2_NF; ++i )
        printf "	%s", line2[i]
    print ""
}
1  # Print lines after the second unchanged

You can use it like

$ awk -f fixit.awk infile.txt

or to pipe it in

$ cat infile.txt | awk -f fixit.awk

This is not the most regular expression inspired solution, but it should make the replacements that you want. For a more complex structure of input, an ideal solution would be to write a scanner and parser that correctly interprets the full input language. Using tools like string substitution might work for simple specific cases, but there could be nuances and assumptions you've made that don't apply in general. A parser can also be more powerful and implement grammars that can express languages which can't be recognized with regular expressions.

regex Pattern Matching over two lines - search and replace

Answers (1)

Multiline `sed`

An AWK script

Related Questions

regex Pattern Matching over two lines - search and replace

Answers (1)

Multiline sed

An AWK script

Related Questions

Multiline `sed`