erd
erd

Reputation: 29

extract content between patterns

On SUSE Linux, I'd like to find complete section between a BEGIN string and END string from a text file. I thought about using sed or awk.

Optionally, I would like to search for the next occurrence in another run.

My challenge is:

Example

something before ----BEGIN
first paragraph
Text Text Text
Text Text Text
Text Text Text
no ending pattern

something before ----BEGIN
second paragraph
Text Text Text
Text Text Text
Text Text Text
END---- some more text

no beginning pattern
Text Text Text
Text Text Text
END---- some more text

something before ----BEGIN
third paragraph
Text Text Text
Text Text Text
Text Text Text
no ending pattern

something before ----BEGIN
fourth paragraph
Text Text Text
Text Text Text
Text Text Text
END---- some more text

Text Text Text

I expect something like this:

----BEGIN
second paragraph
Text Text Text
Text Text Text
Text Text Text
END----

In another run I'd like to find the next complete section:

----BEGIN
fourth paragraph
Text Text Text
Text Text Text
Text Text Text
END----

In forums I could already find something like this:

tac < file.txt | sed  '/END-----/,$!d;/-----BEGIN/q' | tac

But it finds only the last occurrence and doesn't cut the characters at the beginning and the end.

Unfortunately I'm not that experienced in using sed/awk or regex. I would appreciate if you could give me some guidance!

Cheers, erd

Upvotes: 2

Views: 226

Answers (5)

oguz ismail
oguz ismail

Reputation: 50750

Buffer lines between BEGIN and END discarding the buffer whenever BEGIN happens to occur, and print the buffer upon reaching END. Note that this assumes there's always a space before ----BEGIN, and after END----.

awk '/BEGIN$/,/^END/ {
  if(/BEGIN$/) {
    buf=$NF
  }
  else if(/^END/) {
    print buf
    print $1
  }
  else {
    buf=(buf ORS $0)
  }
}' file

Upvotes: 1

potong
potong

Reputation: 58351

This might work for you (GNU sed &bash):

b='----BEGIN' e='END----' n=1
sed -En '/'$b'/{:a;N;/'$e'/!ba;x;s/^/x/;/^x{'$n'}$/!{x;b};x;s/.*('$b'.*'$e').*/\1/p}' file

This gathers up lines between ----BEGIN and END---- and then uses greed to find the last occurrence of ----BEGIN in the resulting string. The number of the result strings presented as results can be determined by the n variable (in the example above it is the first). An example solution for the second would be as so:

b='----BEGIN' e='END----' n=2
sed -En '/'$b'/{:a;N;/'$e'/!ba;x;s/^/x/;/^x{'$n'}$/!{x;b};x;s/.*('$b'.*'$e').*/\1/p}' file

Upvotes: 0

karakfa
karakfa

Reputation: 67467

it looks like the BEGIN/END markers are not reliable and you depend on empty lines between records, which is supported by awk record mode.

$ awk -v n=2 -v RS= 'BEGIN {b="BEGIN"; e="END"; h="----"; s=".*"} 
                     NR==n {sub(s h b, h b); 
                            sub(e h s, e h); 
                            print}' file

----BEGIN
second paragraph
Text Text Text
Text Text Text
Text Text Text
END----

Upvotes: 1

Ed Morton
Ed Morton

Reputation: 203169

$ cat tst.awk
BEGIN { beg="----BEGIN"; end="END----" }
sub(".*"beg,beg) { inBlock=1; buf="" }
inBlock {
    buf = buf $0 ORS
    if ( sub(end".*",end,buf) ) {
        print buf ORS
        inBlock=0
    }
}

$ awk -f tst.awk file
----BEGIN
second paragraph
Text Text Text
Text Text Text
Text Text Text
END----

----BEGIN
fourth paragraph
Text Text Text
Text Text Text
Text Text Text
END----

Upvotes: 4

William Pursell
William Pursell

Reputation: 212198

It's not entirely clear if this will work, but making several assumptions based on the sample input, you might simply try:

awk '/BEGIN/ && /END/' RS= ORS='\n\n' input

That will filter out the records you want (again, I'm making assumptions about what you actually want based on the input sample), and then you can easily select records with a second awk. For example, to get the nth record, you can do something like:

N=2; awk '/BEGIN/ && /END/' RS= ORS='\n\n' input  | awk 'NR==n' n=$N RS=

Put that in a loop with N as the loop counter and you have everything that you (seem to) want.

Upvotes: 1

Related Questions