Gcap
Gcap

Reputation: 378

Awk only processes first line of input file? Extract attribute values from HTML elements

I have a huge text file filled with HTML attributes. I only want the value of the tag. Ex:

<option value="API" datatype="string" datatype_value="0">API</option>
<option value="Account" datatype="string" datatype_value="0">Account</option>
<option value="Address - asn" datatype="string" datatype_value="0">Address - asn</option>

I only want "API" after 'option value'.

Right now I have this:

awk -F "option value=" '{print $2}' /inputFilePath | awk '{print $1}'

I works but ONLY on the first line of the file. So my out put when I run the command above on the file only returns:

"API"

And not "Account", "Address" or anything after.

Any thoughts on anything I could be doing wrong? Thanks in advance!

Upvotes: 0

Views: 1118

Answers (4)

mklement0
mklement0

Reputation: 440412

The symptom suggests that perhaps your <option> elements are on a single line rather than each element on its own line.

Update: The OP to date hasn't provided feedback about what the original problem turned out to be, but given that the accepted answer works regardless of whether a single line contains multiple elements or just one, the above guess is likely.
(This answer originally contained a suboptimal sed solution that the first two comments reference - I've removed it.)

If you can use GNU awk or mawk, the accepted answer is a great solution for the specific problem at hand.

Generally, however, using a dedicated HTML/XML-parsing CLI is preferable - it truly understands the structure of the data and provides a more robust and flexible way to extract data.

For instance, with the multi-platform web-scraping CLI xidel the solution would simplify to:

xidel -q -e '//option/@value' file
  • //option/@value is an XPath query that selects the value attribute of all option elements across all levels of the DOM (make more specific as needed).
  • By default, xidel only extracts the contents of matching nodes, and prints each on a separate line.
  • As an HTML parser, xidel parses the HTML correctly irrespective of variations in non-significant whitespace - it doesn't matter how many lines the elements of interest are spread across.

Upvotes: 0

Rhim
Rhim

Reputation: 674

Add to your example code $1 ~ /API/.

awk -F "option value=" '{print $2}' /inputFilePath | awk '$1 ~ /API/ {print $1}'

Upvotes: -1

Jotne
Jotne

Reputation: 41460

This should work with all awk

awk -F"<option value=" '{split($2,a,"\"");print a[2]}' file
API
Account
Address - asn

If you need the double quote:

awk -F"<option value=" '{split($2,a,"\"");print "\""a[2]"\""}' file
"API"
"Account"
"Address - asn"

Upvotes: 0

konsolebox
konsolebox

Reputation: 75588

Modify RS instead:

awk 'BEGIN { RS = "<option value=\"" ; FS = "\""; } NF { print $1 }' file

Output:

API
Account
Address - asn

I just hope it works with your awk as nawk doesn't.

Yet another using GNU awk:

gawk '{ t = $0; while (match(t, /<option value="([^"]*)"(.*)/, a)) { print a[1]; t = a[2] } }' file

Explicitly I used [^"]* since I find empty values still valid for your query but you can change that to [^"]+ if preferred.

Upvotes: 2

Related Questions