Reputation: 1726
Asking for instructions about using awk to extract text blocks with specific rows from a file.
The file has the following structure:
<Information>
<CID>_whole_number_A_</CID>
<string>_text_that_is_not_useful_</string>
<string>_text_that_is_not_useful_</string>
<string>_PATTERN_A_</string>
<string>_text_that_is_not_useful_</string>
</Information>
<Information>
<CID>_whole_number_B_</CID>
<string>_PATTERN_B_</string>
<string>_text_that_is_not_useful_</string>
<string>_text_that_is_not_useful_</string>
<string>_text_that_is_not_useful_</string>
<string>_text_that_is_not_useful_</string>
<string>_text_that_is_not_useful_</string>
</Information>
Would like to awk to send the following pattern to a new file.
<Information>
<CID>_whole_number_A_</CID>
<string>_PATTERN_A_</string>
</Information>
<Information>
<CID>_whole_number_B_</CID>
<string>_PATTERN_B_</string>
</Information>
Notes about the data:
Notes about my environment:
So, rephrasing in English:
in FILE_1
find every CID that has a UNII
send the filtered results to FILE_2
Thanks in advance for instructions.
========================================================================
OK, I'm doing something wrong.
In my first implementation, the program only returns "record starts" and "closing tag," i.e.:
<Information>
</Information>
Here is how I applied your instructions.
First, I'm running Windows so changed to FS="\r\n"
The first regular expression is UNII, so changed to /UNII/.
The second regular expression is CID, which you used in your instructions. I made no change there.
For the second instance of PATTERN, I changed to /UNII/.
Here is how my substitutions look:
BEGIN {
RS="<Information>"
FS="\r\n"
}
/UNII/ {
print RS
for (i=1;i<NF;i++) {
if ($i ~ /CID/ || $i ~ /UNII/) {
print $i
}
}
print "</Information>"
}
Because I am using Windows, I use a full path to execute the GnuWin32 utilities and read/write data. So my .bat file looks like this:
C:\bin\awk -f C:\bin\script.awk < C:\Users\Owner\data\input_file.txt > C:\Users\Owner\data\output_file.txt
What am I doing wrong?
================================================================================= Here is sample data:
<Information>
<CID>1</CID>
<Synonym>Acetyl carnitine</Synonym>
<Synonym>O-Acetyl-L-carnitine</Synonym>
<Synonym>Ammonium, (3-carboxy-2-hydroxypropyl)trimethyl-, hydroxide, inner salt, acetate, DL-</Synonym>
<Synonym>UNII-07OP6H4V4A</Synonym>
<Synonym>_20+_more_</Synonym>
</Information>
<Information>
<CID>10006</CID>
<Synonym>HYDANTOIN</Synonym>
<Synonym>UNII-I6208298TA</Synonym>
<Synonym>53760_FLUKA</Synonym>
<Synonym>NSC9226</Synonym>
<Synonym>_20+_more_</Synonym>
</Information>
<Information>
<CID>10007</CID>
<Synonym>Lucofen SA</Synonym>
<Synonym>461-78-9</Synonym>
<Synonym>EINECS 207-314-9</Synonym>
<Synonym>STK664067</Synonym>
<Synonym>DEA No. 1645</Synonym>
<Synonym>UNII-NHW07912O7</Synonym>
<Synonym>CHEMBL1201269</Synonym>
<Synonym>HMS1376E21</Synonym>
<Synonym>_20+_more_</Synonym>
</Information>
Upvotes: 2
Views: 765
Reputation: 85785
This script should provide a good starting point:
BEGIN {
RS="<Information>"
FS="\n"
}
/UNII/ {
print RS
for (i=1;i<NF;i++) {
if ($i ~ /CID/ || $i ~ /UNII/) {
print $i
}
}
print "</Information>"
}
Saving it to script.awk
and running it on your sample input produces:
$ awk -f script.awk file
<Information>
<CID>1</CID>
<Synonym>UNII-07OP6H4V4A</Synonym>
</Information>
<Information>
<CID>10006</CID>
<Synonym>UNII-I6208298TA</Synonym>
</Information>
<Information>
<CID>10007</CID>
<Synonym>UNII-NHW07912O7</Synonym>
</Information>
Upvotes: 1
Reputation: 8866
First, awk is completely the wrong tool for this. But the simplest way to do this with awk, is to suppress the lines you don't want (rather than selecting the ones you do want):
/Synonym/ && !/UNII/ { next }
{ print }
Upvotes: 1