Reputation: 29

extract content between patterns

On SUSE Linux, I'd like to find complete section between a BEGIN string and END string from a text file. I thought about using sed or awk.

Optionally, I would like to search for the next occurrence in another run.

It should become part of a bash script
The result should be written into a file

My challenge is:

BEGIN string occurs several times at the beginning before the END string comes
BEGIN string has sometimes other characters before on the same line
END string has sometimes other characters after on the same line

Example

something before ----BEGIN
first paragraph
Text Text Text
Text Text Text
Text Text Text
no ending pattern

something before ----BEGIN
second paragraph
Text Text Text
Text Text Text
Text Text Text
END---- some more text

no beginning pattern
Text Text Text
Text Text Text
END---- some more text

something before ----BEGIN
third paragraph
Text Text Text
Text Text Text
Text Text Text
no ending pattern

something before ----BEGIN
fourth paragraph
Text Text Text
Text Text Text
Text Text Text
END---- some more text

Text Text Text

I expect something like this:

----BEGIN
second paragraph
Text Text Text
Text Text Text
Text Text Text
END----

In another run I'd like to find the next complete section:

----BEGIN
fourth paragraph
Text Text Text
Text Text Text
Text Text Text
END----

In forums I could already find something like this:

tac < file.txt | sed  '/END-----/,$!d;/-----BEGIN/q' | tac

But it finds only the last occurrence and doesn't cut the characters at the beginning and the end.

Unfortunately I'm not that experienced in using sed/awk or regex. I would appreciate if you could give me some guidance!

Cheers, erd

Upvotes: 2

Answers (5)

oguz ismail

Reputation: 50805

Buffer lines between BEGIN and END discarding the buffer whenever BEGIN happens to occur, and print the buffer upon reaching END. Note that this assumes there's always a space before ----BEGIN, and after END----.

awk '/BEGIN$/,/^END/ {
  if(/BEGIN$/) {
    buf=$NF
  }
  else if(/^END/) {
    print buf
    print $1
  }
  else {
    buf=(buf ORS $0)
  }
}' file

Upvotes: 1

potong

Reputation: 58558

This might work for you (GNU sed &bash):

b='----BEGIN' e='END----' n=1
sed -En '/'$b'/{:a;N;/'$e'/!ba;x;s/^/x/;/^x{'$n'}$/!{x;b};x;s/.*('$b'.*'$e').*/\1/p}' file

This gathers up lines between ----BEGIN and END---- and then uses greed to find the last occurrence of ----BEGIN in the resulting string. The number of the result strings presented as results can be determined by the n variable (in the example above it is the first). An example solution for the second would be as so:

b='----BEGIN' e='END----' n=2
sed -En '/'$b'/{:a;N;/'$e'/!ba;x;s/^/x/;/^x{'$n'}$/!{x;b};x;s/.*('$b'.*'$e').*/\1/p}' file

Upvotes: 0

karakfa

Reputation: 67557

it looks like the BEGIN/END markers are not reliable and you depend on empty lines between records, which is supported by awk record mode.

$ awk -v n=2 -v RS= 'BEGIN {b="BEGIN"; e="END"; h="----"; s=".*"} 
                     NR==n {sub(s h b, h b); 
                            sub(e h s, e h); 
                            print}' file

----BEGIN
second paragraph
Text Text Text
Text Text Text
Text Text Text
END----

Upvotes: 1

Ed Morton

Reputation: 204488

$ cat tst.awk
BEGIN { beg="----BEGIN"; end="END----" }
sub(".*"beg,beg) { inBlock=1; buf="" }
inBlock {
    buf = buf $0 ORS
    if ( sub(end".*",end,buf) ) {
        print buf ORS
        inBlock=0
    }
}

$ awk -f tst.awk file
----BEGIN
second paragraph
Text Text Text
Text Text Text
Text Text Text
END----

----BEGIN
fourth paragraph
Text Text Text
Text Text Text
Text Text Text
END----

Upvotes: 4

William Pursell

Reputation: 212584

It's not entirely clear if this will work, but making several assumptions based on the sample input, you might simply try:

awk '/BEGIN/ && /END/' RS= ORS='\n\n' input

That will filter out the records you want (again, I'm making assumptions about what you actually want based on the input sample), and then you can easily select records with a second awk. For example, to get the nth record, you can do something like:

N=2; awk '/BEGIN/ && /END/' RS= ORS='\n\n' input  | awk 'NR==n' n=$N RS=

Put that in a loop with N as the loop counter and you have everything that you (seem to) want.

Upvotes: 1

extract content between patterns

Answers (5)

Related Questions