Reputation: 446

how to extract text between two delimiters when string is present

I have a large data file which looks like:

//
ID   1.1.1.258
DE   6-hydroxyhexanoate dehydrogenase.
CA   6-hydroxyhexanoate + NAD(+) = 6-oxohexanoate + NADH.
CC   -!- Involved in the cyclohexanol degradation pathway in Acinetobacter
CC       NCIB 9871.
//
ID   1.1.1.259
DE   3-hydroxypimeloyl-CoA dehydrogenase.
CA   3-hydroxypimeloyl-CoA + NAD(+) = 3-oxopimeloyl-CoA + NADH.
CC   -!- Involved in the anaerobic pathway of benzoate degradation in
CC       bacteria.
//
ID   1.1.1.260
DE   Sulcatone reductase.
CA   Sulcatol + NAD(+) = sulcatone + NADH.
CC   -!- Studies on the effects of growth-stage and nutrient supply on the
CC       stereochemistry of sulcatone reduction in Clostridia pasteurianum,
CC       C.tyrobutyricum and Lactobacillus brevis suggest that there may be at
CC       least two sulcatone reductases with different stereospecificities.
//

I want to extract sections of this file that contain the work anaerobic . I specifically want the ID line.

Is there a means to search the file between ID and // to find anaerobicand print the output to a new file? If the whole section is printed that is fine as I figure I can grep it out after.

Expected out should be either

ID   1.1.1.259

ID   1.1.1.259
DE   3-hydroxypimeloyl-CoA dehydrogenase.
CA   3-hydroxypimeloyl-CoA + NAD(+) = 3-oxopimeloyl-CoA + NADH.
CC   -!- Involved in the anaerobic pathway of benzoate degradation in
CC       bacteria.
//

Upvotes: 2

Answers (3)

PesaThe

Reputation: 7499

For variety, possible GNU sed solution:

sed -nr ':a; \@(^|\n)//$@! { N; ba }; /anaerobic/p' data

-n => suppresses automatic printing of pattern space
-r => extended regular expressions
:a => definition of a label
ba => jumps to the label a
N => appends next line to the pattern space
\@(^|\n)//$@! => matches "sections" that don't end with //

\@(^|\n)//$@! { N; ba } therefore appends next line to the pattern spaces until it finds the // section delimiter. /anaerobic/p then checks if the current section contains anaerobic and if it does, pcommand prints it.

Upvotes: 2

Дмитрий Шатов

Reputation: 129

it's simple with awk

awk '/anaerobic/' RS='//\n' ORS='\n//' ./file.txt

Upvotes: 3

5axola

Reputation: 80

tac file | sed -n '/anaerobic/,$p' | sed -n '/^ID/ {p;q}'

tac **file**: print file from end to beginning
sed -n '/anaerobic/,$p': print from first occurrence of anaerobic to the end of file
sed -n '/^ID/ {p;q}': search for a line starting with ID, print the first ocurrence only

Upvotes: 2

how to extract text between two delimiters when string is present

Answers (3)

Related Questions