Reputation: 378
I have a huge text file filled with HTML attributes. I only want the value of the tag. Ex:
<option value="API" datatype="string" datatype_value="0">API</option>
<option value="Account" datatype="string" datatype_value="0">Account</option>
<option value="Address - asn" datatype="string" datatype_value="0">Address - asn</option>
I only want "API" after 'option value'.
Right now I have this:
awk -F "option value=" '{print $2}' /inputFilePath | awk '{print $1}'
I works but ONLY on the first line of the file. So my out put when I run the command above on the file only returns:
"API"
And not "Account", "Address" or anything after.
Any thoughts on anything I could be doing wrong? Thanks in advance!
Upvotes: 0
Views: 1118
Reputation: 440412
The symptom suggests that perhaps your <option>
elements are on a single line rather than each element on its own line.
Update: The OP to date hasn't provided feedback about what the original problem turned out to be, but given that the accepted answer works regardless of whether a single line contains multiple elements or just one, the above guess is likely.
(This answer originally contained a suboptimal sed
solution that the first two comments reference - I've removed it.)
If you can use GNU awk
or mawk
, the accepted answer is a great solution for the specific problem at hand.
Generally, however, using a dedicated HTML/XML-parsing CLI is preferable - it truly understands the structure of the data and provides a more robust and flexible way to extract data.
For instance, with the multi-platform web-scraping CLI xidel the solution would simplify to:
xidel -q -e '//option/@value' file
//option/@value
is an XPath query that selects the value
attribute of all option
elements across all levels of the DOM (make more specific as needed).Upvotes: 0
Reputation: 674
Add to your example code $1 ~ /API/.
awk -F "option value=" '{print $2}' /inputFilePath | awk '$1 ~ /API/ {print $1}'
Upvotes: -1
Reputation: 41460
This should work with all awk
awk -F"<option value=" '{split($2,a,"\"");print a[2]}' file
API
Account
Address - asn
If you need the double quote:
awk -F"<option value=" '{split($2,a,"\"");print "\""a[2]"\""}' file
"API"
"Account"
"Address - asn"
Upvotes: 0
Reputation: 75588
Modify RS instead:
awk 'BEGIN { RS = "<option value=\"" ; FS = "\""; } NF { print $1 }' file
Output:
API
Account
Address - asn
I just hope it works with your awk
as nawk
doesn't.
Yet another using GNU awk:
gawk '{ t = $0; while (match(t, /<option value="([^"]*)"(.*)/, a)) { print a[1]; t = a[2] } }' file
Explicitly I used [^"]*
since I find empty values still valid for your query but you can change that to [^"]+
if preferred.
Upvotes: 2