Jay Gray
Jay Gray

Reputation: 1726

awk - how to extract a pattern

Asking for instructions about using awk to extract text blocks with specific rows from a file.

The file has the following structure:

<Information>
<CID>_whole_number_A_</CID>
<string>_text_that_is_not_useful_</string>
<string>_text_that_is_not_useful_</string>
<string>_PATTERN_A_</string>
<string>_text_that_is_not_useful_</string>
</Information>
<Information>
<CID>_whole_number_B_</CID>
<string>_PATTERN_B_</string>
<string>_text_that_is_not_useful_</string>
<string>_text_that_is_not_useful_</string>
<string>_text_that_is_not_useful_</string>
<string>_text_that_is_not_useful_</string>
<string>_text_that_is_not_useful_</string>
</Information>

Would like to awk to send the following pattern to a new file.

<Information>
<CID>_whole_number_A_</CID>
<string>_PATTERN_A_</string>
</Information>
<Information>
<CID>_whole_number_B_</CID>
<string>_PATTERN_B_</string>
</Information>

Notes about the data:

Notes about my environment:

So, rephrasing in English:

in FILE_1

find every CID that has a UNII

send the filtered results to FILE_2

Thanks in advance for instructions.

========================================================================

OK, I'm doing something wrong.

In my first implementation, the program only returns "record starts" and "closing tag," i.e.:

<Information>
</Information>

Here is how I applied your instructions.

First, I'm running Windows so changed to FS="\r\n"

The first regular expression is UNII, so changed to /UNII/.

The second regular expression is CID, which you used in your instructions. I made no change there.

For the second instance of PATTERN, I changed to /UNII/.

Here is how my substitutions look:

BEGIN {
    RS="<Information>"
    FS="\r\n"
}
/UNII/ {
    print RS
    for (i=1;i<NF;i++) {
        if ($i ~ /CID/ || $i ~ /UNII/) {
            print $i
        }
    }
    print "</Information>"
}

Because I am using Windows, I use a full path to execute the GnuWin32 utilities and read/write data. So my .bat file looks like this:

C:\bin\awk -f C:\bin\script.awk < C:\Users\Owner\data\input_file.txt > C:\Users\Owner\data\output_file.txt

What am I doing wrong?

================================================================================= Here is sample data:

<Information>
    <CID>1</CID>
    <Synonym>Acetyl carnitine</Synonym>
    <Synonym>O-Acetyl-L-carnitine</Synonym>
    <Synonym>Ammonium, (3-carboxy-2-hydroxypropyl)trimethyl-, hydroxide, inner salt, acetate, DL-</Synonym>
    <Synonym>UNII-07OP6H4V4A</Synonym>
    <Synonym>_20+_more_</Synonym>
</Information>
<Information>
    <CID>10006</CID>
    <Synonym>HYDANTOIN</Synonym>
    <Synonym>UNII-I6208298TA</Synonym>
    <Synonym>53760_FLUKA</Synonym>
    <Synonym>NSC9226</Synonym>
    <Synonym>_20+_more_</Synonym>
</Information>
<Information>
    <CID>10007</CID>
    <Synonym>Lucofen SA</Synonym>
    <Synonym>461-78-9</Synonym>
    <Synonym>EINECS 207-314-9</Synonym>
    <Synonym>STK664067</Synonym>
    <Synonym>DEA No. 1645</Synonym>
    <Synonym>UNII-NHW07912O7</Synonym>
    <Synonym>CHEMBL1201269</Synonym>
    <Synonym>HMS1376E21</Synonym>
    <Synonym>_20+_more_</Synonym>
</Information>

Upvotes: 2

Views: 765

Answers (2)

Chris Seymour
Chris Seymour

Reputation: 85785

This script should provide a good starting point:

BEGIN {
    RS="<Information>"
    FS="\n"
}
/UNII/ {
    print RS
    for (i=1;i<NF;i++) {
        if ($i ~ /CID/ || $i ~ /UNII/) {
            print $i
        }
    }
    print "</Information>"
}

Saving it to script.awk and running it on your sample input produces:

$ awk -f script.awk file
<Information>
    <CID>1</CID>
    <Synonym>UNII-07OP6H4V4A</Synonym>
</Information>
<Information>
    <CID>10006</CID>
    <Synonym>UNII-I6208298TA</Synonym>
</Information>
<Information>
    <CID>10007</CID>
    <Synonym>UNII-NHW07912O7</Synonym>
</Information>

Upvotes: 1

pdw
pdw

Reputation: 8866

First, awk is completely the wrong tool for this. But the simplest way to do this with awk, is to suppress the lines you don't want (rather than selecting the ones you do want):

/Synonym/ && !/UNII/ { next }
{ print }

Upvotes: 1

Related Questions