Reputation: 57
I have a cURL Bash script that goes to a website and posts data, then returns that to a text file. The text file comes back all in HTML and I cant figure out how to extract the information I need from it. Here is the HTML from Info.txt:
<table cellspacing="1" cellpadding="0" border="0">
<tr><td><img src="/themes/img/status/green.gif" width="12" height="12" border="0"/></td><td><font class="small"><i>October 15, 2013 @ 1:34pm (PST)</i></font></td></tr>
<tr><td><font class="small">MF: </font></td><td><font class="small">PSVBHP9001230079779201</font></td></tr>
<tr><td><font class="small">SN: </font></td><td><font class="small">1354716309166</font></td></tr>
<tr><td><font class="small">ID: </font></td><td><font class="small">800.10</font></td></tr>
</table>
I need to extract these 3 values:
I have tried this using grep, but have not had much success. I can't seem to figure out how to extract just the values I want. I have tried multiple sed and awk commands as well but the closest I could come is with this grep command:
$ grep -o '[^ ]*.PSV[^ ]*' Info.txt
<tr><td><font>PSVBHP9001230079779201</font></td></tr>
Upvotes: 1
Views: 8438
Reputation: 84343
While parsing HTML is the canonically-correct solution, you certainly have other options. One of those options is to convert the HTML into a flat format that can be filtered or split with the tools of your choice. PYX notation and the intuitive but undocumented format used by xml2 tools are two ways to represent an HTML document in a line-oriented format. For this use case, I recommend the latter.
Given your posted corpus, the following will work with the html2 utility from the xml2 package:
$ html2 < /tmp/info.txt | fgrep /td/ | egrep -v '[:@]' | cut -d= -f2
PSVBHP9001230079779201
1354716309166
800.10
This works by:
Flattening HTML is obviously a bit of a hack, and the recipe may require additional filtering to fit your real corpus. On the other hand, it works well from the command line and doesn't require any deep knowledge of the document type definition, document object model, or XPath. It also leverages your knowledge of core utilities like sed, grep, awk, cut, and so on.
Your mileage may vary.
Upvotes: 1
Reputation: 203502
$ awk -F'[<>]' '/<tr><td><font/{print $15}' file
PSVBHP9001230079779201
1354716309166
800.10
Upvotes: 1
Reputation: 84343
Sometimes you can get away with grepping HTML if:
Your corpus doesn't seem to fit these criteria, so use an HTML or XML parser instead for best results.
Ruby's Nokogiri gem and XPath selectors make quick work of this. For example:
require 'nokogiri'
doc = Nokogiri::HTML(File.read '/tmp/info.txt');
doc.xpath('//td[2]').map(&:content).reject { |e| e.include? ':' }
#=> ["PSVBHP9001230079779201", "1354716309166", "800.10"]
This will select the second cell from each row and discard any results with a colon. If you aren't sure that the field you want will always be in the second cell, then your corpus will also match properly with this alternative:
doc.xpath('//td').map(&:content).reject { |e| e.empty? or e.include? ':' }
#=> ["PSVBHP9001230079779201", "1354716309166", "800.10"]
You can certainly adjust the selectors to match any changes to your corpus, or store the results in a variable so you can refine the results after the parser returns candidate fields. The sky's the limit, but this should be enough to get you started.
Upvotes: 1