Reputation: 35
I am trying to create a BASH/Perl script which would get a specific value from a dynamic html table.
Here is a sample of my page
<table border="1" bordercolor="#FFCC00" style="background-color:#FFFFCC" width="100%" cellpadding="3" cellspacing="3"> <tr align="center"> <th>Environment</th><th>Release Track</th><th>Artifact</th><th>Name</th><th>Build #</th><th>Cert Idn</th><th>Build Idn</th><th>Request Status</th><th>Update Time</th><th>Log Info.</th><th>Initiator</th> </tr> <tr> <td>DEV03</td><td>2.1.0</td><td>abpa</td><td>ecom-abpa-ear</td><td>204</td><td>82113</td><td>171242</td><td>Deployed</td><td>3/18/2013 3:10:58 PM</td><td width="70">Log info</a></td><td>CESAR</td> </tr> <tr> <td>DEV03</td><td>2.1.0</td><td>abpa</td><td>abpa_dynamic_config_properties</td><td>20</td><td>82113</td><td>167598</td><td>Deployed</td><td>3/18/2013 2:32:27 PM</td><td width="70">Log info</a></td><td>CESAR</td> </tr> </table>
My goal is to get this value from this cell.
"Deployed"
Another way to look at it...
Retrieve all data under the "Request Status" column
The value "Deployed" is dynamic and could change.
I have tried the following:
sed -e 's/>/>\n/g' abpa_cesar_status.txt | egrep -i "^\s*[A-Z]+</td>
" | sed -e 's|</td>||g' | grep Deployed
But that only greps for "Deployed"
Any ideas?
Upvotes: 3
Views: 5623
Reputation: 43401
Note that your document output is ill-formed (lack some opening <a>
), is it normal/excpected or a typo ? Otherwise, here is a well-formed version.
I like xmlstarlet, simple and straight forward XPath for short tests:
xmlstarlet sel -t -m "//table/tr/td[position()=8]" -v "./text()" -n
sel (or select) - Select data (mode) or query XML document(s) (XPATH, etc)
-t or --template - start a template
-m or --match <xpath> - match XPATH expression
-v or --value-of <xpath> - print value of XPATH expression
-n or --nl - print new line
Deployed
Deployed
# plus empty-cell
Upvotes: 2
Reputation: 16917
You can also use my Xidel to get everything in the 8-th column:
xidel your_table.html -e '//table//tr/td[8]'
Or if the column position can also change, get the column-number first:
xidel your_table.html -e 'column:=count(//table//th[.="Request Status"]/preceding-sibling::*)+1' -e '//table//tr/td[$column]'
Upvotes: 3
Reputation: 241858
You can try xsh, a wrapper around XML::LibXML:
open :F html abpa_cesar_status.txt ;
$status = count(//table/tr[1]/th[.="Request Status"]/preceding-sibling::th) ;
ls //td[count(preceding-sibling::td)=$status] ;
In order to use it, you have to make your html a bit more well formed, though (I had to remove </a>
to make the script work).
Upvotes: 2
Reputation: 27283
Quick and dirty:
cat your_html_file | perl -pe "s/^<\/?table.*$//g;s/^<tr .*$//g;s/<tr> (<td>.*?){8}//g;s/<th.*$//g;s/<\/.*$//g" | sed '/^$/d'
However, this is not how you should do it. Use existing (Perl?) software to parse html and extract your value.
edit: Since you changed your code (added whitespaces), this doesn't work anymore. QED.
Upvotes: 0
Reputation: 274612
You should use a parser such as xmllint
to do this.
With xmllint
you can extract elements based on an xpath.
For example:
$ xmllint --html --format --shell file.html <<< "cat //table/tr/td[position()=8]/text()"
/ > -------
Deployed
-------
Deployed
/ >
The xpath //table/tr/td[position()=8]/text()
, in the command above, returns the values from the 8th table column.
Upvotes: 3