Corey Stadnyk
Corey Stadnyk

Reputation: 57

How can I extract data from HTML table cells using sed, awk, or grep?

I have a cURL Bash script that goes to a website and posts data, then returns that to a text file. The text file comes back all in HTML and I cant figure out how to extract the information I need from it. Here is the HTML from Info.txt:

<table cellspacing="1" cellpadding="0" border="0">
<tr><td><img src="/themes/img/status/green.gif" width="12" height="12" border="0"/></td><td><font class="small"><i>October 15, 2013 @ 1:34pm (PST)</i></font></td></tr>
<tr><td><font class="small">MF:&nbsp;&nbsp;</font></td><td><font class="small">PSVBHP9001230079779201</font></td></tr>
<tr><td><font class="small">SN:&nbsp;&nbsp;</font></td><td><font class="small">1354716309166</font></td></tr>
<tr><td><font class="small">ID:&nbsp;&nbsp;</font></td><td><font class="small">800.10</font></td></tr>
</table>

I need to extract these 3 values:

I have tried this using grep, but have not had much success. I can't seem to figure out how to extract just the values I want. I have tried multiple sed and awk commands as well but the closest I could come is with this grep command:

$ grep -o '[^ ]*.PSV[^ ]*' Info.txt
<tr><td><font>PSVBHP9001230079779201</font></td></tr>

Upvotes: 1

Views: 8438

Answers (3)

Todd A. Jacobs
Todd A. Jacobs

Reputation: 84343

Use the XML2 Suite

While parsing HTML is the canonically-correct solution, you certainly have other options. One of those options is to convert the HTML into a flat format that can be filtered or split with the tools of your choice. PYX notation and the intuitive but undocumented format used by xml2 tools are two ways to represent an HTML document in a line-oriented format. For this use case, I recommend the latter.

An Example of Flattened HTML

Given your posted corpus, the following will work with the html2 utility from the xml2 package:

$ html2 < /tmp/info.txt | fgrep /td/ | egrep -v '[:@]' | cut -d= -f2
PSVBHP9001230079779201
1354716309166
800.10

This works by:

  1. transforming the HTML into a line-oriented representation,
  2. selecting table cells with a fixed-string grep,
  3. removing attributes and lines containing a colon with an extended regular expression, and
  4. selecting the node value with cut.

Flattening HTML is obviously a bit of a hack, and the recipe may require additional filtering to fit your real corpus. On the other hand, it works well from the command line and doesn't require any deep knowledge of the document type definition, document object model, or XPath. It also leverages your knowledge of core utilities like sed, grep, awk, cut, and so on.

Your mileage may vary.

Upvotes: 1

Ed Morton
Ed Morton

Reputation: 203502

$ awk -F'[<>]' '/<tr><td><font/{print $15}' file
PSVBHP9001230079779201
1354716309166
800.10

Upvotes: 1

Todd A. Jacobs
Todd A. Jacobs

Reputation: 84343

Parse HTML, Don't Grep It

Sometimes you can get away with grepping HTML if:

  1. you know the input format will remain consistent, and
  2. your data is very regular.

Your corpus doesn't seem to fit these criteria, so use an HTML or XML parser instead for best results.

Use Nokogiri

Ruby's Nokogiri gem and XPath selectors make quick work of this. For example:

require 'nokogiri'
doc = Nokogiri::HTML(File.read '/tmp/info.txt');
doc.xpath('//td[2]').map(&:content).reject { |e| e.include? ':' }
#=> ["PSVBHP9001230079779201", "1354716309166", "800.10"]

This will select the second cell from each row and discard any results with a colon. If you aren't sure that the field you want will always be in the second cell, then your corpus will also match properly with this alternative:

doc.xpath('//td').map(&:content).reject { |e| e.empty? or e.include? ':' }
#=> ["PSVBHP9001230079779201", "1354716309166", "800.10"]

You can certainly adjust the selectors to match any changes to your corpus, or store the results in a variable so you can refine the results after the parser returns candidate fields. The sky's the limit, but this should be enough to get you started.

Upvotes: 1

Related Questions