Chris
Chris

Reputation: 444

Extract the string between nth pattern1 and mth pattern2 - sed/awk

In the example(s) below, how might one return the string between the nth pattern1 and mth pattern2, where pattern1 and pattern2 may occur more than once each in the string?

Better Example:

zz this is string xx zz another string xx third string zz xx and a tail xx

How would you return between the second zz and the third xx?

i.e.

another string xx third string zz

Edit: For anyone looking for this, 'Capture Groups' and 'Forward/Backward Referencing' in regular expressions seems to be the terminology for what is required for this task..Helpful information here.

EDITED ABOVE

Sorry to the inital helpers. Your answers were good, my example was poorly chosen.

Feel free to remove from here down to tidy this question up. I'm just leaving it for the sake of completeness for the original answers contributed.

Poor Original Question and Example:

echo '1a 2b 3c 4d 5e 6f 7g 8h 9i 0j'

Bonus marks if you can come up with a solution both without trailing spaces. I know leading/trailing spaces could be removed by piping to sed again, but I'm curious if there is a neater solution. The output I would expect (excluding the single quotes) is:

'3c 4d 5e 6f' or ' 3c 4d 5e 6f '

I tried a few variants. I believe this is the nearest to correct with sed:

echo '1 2 3 4 5 6 7 8 9 0' | sed -n 's/.*[ ]{2}.*[ ]{4}.*/\1/p'

But, it returns the error:

sed -e expression #1, char 28: invalid reference \1 on `s' command's RHS

Upvotes: 1

Views: 211

Answers (3)

M. Nejat Aydin
M. Nejat Aydin

Reputation: 10133

With the plain GNU sed:

pat1='zz'
n=2
pat2='xx'
m=3

echo 'zz this is string xx zz another string xx third string zz xx and a tail xx' |
sed "s/$pat1/\n/$n; s/$pat2/\n/$m; s/[^\n]*\n//; s/\n.*//"

Outputs

 another string xx third string zz    

s/$pat1/\n/$n replaces the $nth $pat1 with a newline character.
s/$pat2/\n/$m replaces the $mth $pat2 with a newline character.
s/[^\n]*\n// deletes the part between the beginning of the string and the first newline character(inclusive).
s/\n.*// deletes the part between the newline character(inclusive) and the end of the string.

Note: The sed command could be simplified slightly as sed -E "s/$pat1/\n/$n; s/$pat2/\n/$m; s/.*\n(.*)\n.*/\1/"

Upvotes: 2

Ed Morton
Ed Morton

Reputation: 204035

Using any awk in any shell on every UNIX box:

$ cat tst.awk
BEGIN {
    n = 2
    m = 3
}
{
    $0 = encode($0)
    beg = match($0,"([^<]*<){"n"}") + RLENGTH
    end = match($0,"([^>]*>){"m-1"}[^>]+") + RLENGTH
    print decode(substr($0,beg,end-beg))
}

function encode(str) {
    gsub(/@/,"@A",str); gsub(/</,"@B",str); gsub(/>/,"@C",str)
    gsub(/zz/,"<",str); gsub(/xx/,">",str)
    return str
}

function decode(str) {
    gsub(/>/,"xx",str); gsub(/</,"zz",str)
    gsub(/@C/,">",str); gsub(/@B/,"<",str); gsub(/@A/,"@",str)
    return str
}

.

$ awk -f tst.awk file
 another string xx third string zz

The encode() and decode() functions are how to turn the strings you're interested in into single chars that cannot exist anywhere in the input so you can then negate them in a bracket expression as used in the match() calls.

Upvotes: 3

Sundeep
Sundeep

Reputation: 23677

With perl

$ s='zz this is string xx zz another string xx third string zz xx and a tail xx'

$ echo "$s" | perl -pe 's/((.*?xx){3}).*/$1/'
zz this is string xx zz another string xx third string zz xx

$ echo "$s" | perl -pe 's/((.*?xx){3}).*/$1=~s#(.*?zz){2}\s*|\s*xx$##gr/e'
another string xx third string zz

The first command s/((.*?xx){3}).*/$1/ shows how to get up to third occurrence of xx where .*? is non-greedy matching to consume as minimally as possible.

The e flag allows Perl code in replacement section, so you can modify this string to remove up to second occurrence of zz and the last xx with $1=~s#(.*?zz){2}\s*|\s*xx$##gr

Upvotes: 3

Related Questions