Awk only processes first line of input file? Extract attribute values from HTML elements

I have a huge text file filled with HTML attributes. I only want the value of the tag. Ex:

<option value="API" datatype="string" datatype_value="0">API</option>
<option value="Account" datatype="string" datatype_value="0">Account</option>
<option value="Address - asn" datatype="string" datatype_value="0">Address - asn</option>

I only want "API" after 'option value'.

Right now I have this:

awk -F "option value=" '{print $2}' /inputFilePath | awk '{print $1}'

I works but ONLY on the first line of the file. So my out put when I run the command above on the file only returns:

"API"

And not "Account", "Address" or anything after.

Any thoughts on anything I could be doing wrong? Thanks in advance!

Upvotes: 0

Answers (4)

mklement0

Reputation: 440412

The symptom suggests that perhaps your <option> elements are on a single line rather than each element on its own line.

^{Update: The OP to date hasn't provided feedback about what the original problem turned out to be, but given that the accepted answer works regardless of whether a single line contains multiple elements or just one, the above guess is likely.

(This answer originally contained a suboptimal sed solution that the first two comments reference - I've removed it.)}

If you can use GNU awk or mawk, the accepted answer is a great solution for the specific problem at hand.

Generally, however, using a dedicated HTML/XML-parsing CLI is preferable - it truly understands the structure of the data and provides a more robust and flexible way to extract data.

For instance, with the multi-platform web-scraping CLI xidel the solution would simplify to:

xidel -q -e '//option/@value' file

//option/@value is an XPath query that selects the value attribute of all option elements across all levels of the DOM (make more specific as needed).
By default, xidel only extracts the contents of matching nodes, and prints each on a separate line.
As an HTML parser, xidel parses the HTML correctly irrespective of variations in non-significant whitespace - it doesn't matter how many lines the elements of interest are spread across.

Upvotes: 0

Rhim

Reputation: 674

Add to your example code $1 ~ /API/.

awk -F "option value=" '{print $2}' /inputFilePath | awk '$1 ~ /API/ {print $1}'

Upvotes: -1

Jotne

Reputation: 41460

This should work with all awk

awk -F"<option value=" '{split($2,a,"\"");print a[2]}' file
API
Account
Address - asn

If you need the double quote:

awk -F"<option value=" '{split($2,a,"\"");print "\""a[2]"\""}' file
"API"
"Account"
"Address - asn"

Upvotes: 0

konsolebox

Reputation: 75588

Modify RS instead:

awk 'BEGIN { RS = "<option value=\"" ; FS = "\""; } NF { print $1 }' file

Output:

API
Account
Address - asn

I just hope it works with your awk as nawk doesn't.

Yet another using GNU awk:

gawk '{ t = $0; while (match(t, /<option value="([^"]*)"(.*)/, a)) { print a[1]; t = a[2] } }' file

Explicitly I used [^"]* since I find empty values still valid for your query but you can change that to [^"]+ if preferred.

Upvotes: 2

Awk only processes first line of input file? Extract attribute values from HTML elements

Answers (4)

Related Questions