awk - how to extract a pattern

Question

Asking for instructions about using awk to extract text blocks with specific rows from a file.

The file has the following structure:


_whole_number_A_
_text_that_is_not_useful_
_text_that_is_not_useful_
_PATTERN_A_
_text_that_is_not_useful_


_whole_number_B_
_PATTERN_B_
_text_that_is_not_useful_
_text_that_is_not_useful_
_text_that_is_not_useful_
_text_that_is_not_useful_
_text_that_is_not_useful_

Would like to awk to send the following pattern to a new file.


_whole_number_A_
_PATTERN_A_


_whole_number_B_
_PATTERN_B_

Notes about the data:

The file has 300,000+ CID items; each identified with a unique whole number.
The PATTERNs (_PATTERN_A_, _PATTERN_B_, etc.) have the format UNII-<10 characters>. For example: UNII-4J4Z8788N8 or UNII-12L95QD6KV.
Not every CID has a UNII.

Notes about my environment:

Am working under Windows 7 and using the GnuWin32 utilities

So, rephrasing in English:

in FILE_1

find every CID that has a UNII

send the filtered results to FILE_2

Thanks in advance for instructions.

========================================================================

OK, I'm doing something wrong.

In my first implementation, the program only returns "record starts" and "closing tag," i.e.:

Here is how I applied your instructions.

First, I'm running Windows so changed to FS=" "

The first regular expression is UNII, so changed to /UNII/.

The second regular expression is CID, which you used in your instructions. I made no change there.

For the second instance of PATTERN, I changed to /UNII/.

Here is how my substitutions look:

BEGIN {
    RS=""
    FS="
"
}
/UNII/ {
    print RS
    for (i=1;i"
}

Because I am using Windows, I use a full path to execute the GnuWin32 utilities and read/write data. So my .bat file looks like this:

C:\bin\awk -f C:\bin\script.awk < C:\Users\Owner\data\input_file.txt > C:\Users\Owner\data\output_file.txt

What am I doing wrong?

================================================================================= Here is sample data:


    1
    Acetyl carnitine
    O-Acetyl-L-carnitine
    Ammonium, (3-carboxy-2-hydroxypropyl)trimethyl-, hydroxide, inner salt, acetate, DL-
    UNII-07OP6H4V4A
    _20+_more_


    10006
    HYDANTOIN
    UNII-I6208298TA
    53760_FLUKA
    NSC9226
    _20+_more_


    10007
    Lucofen SA
    461-78-9
    EINECS 207-314-9
    STK664067
    DEA No. 1645
    UNII-NHW07912O7
    CHEMBL1201269
    HMS1376E21
    _20+_more_

Chris Seymour · Accepted Answer

This script should provide a good starting point:

BEGIN {
    RS=""
    FS="
"
}
/UNII/ {
    print RS
    for (i=1;i"
}

Saving it to script.awk and running it on your sample input produces:

$ awk -f script.awk file

    1
    UNII-07OP6H4V4A


    10006
    UNII-I6208298TA


    10007
    UNII-NHW07912O7

awk - how to extract a pattern

Answers (2)

Related Questions