user2187297
user2187297

Reputation: 35

Extract cell value from html table using bash

I am trying to create a BASH/Perl script which would get a specific value from a dynamic html table.

Here is a sample of my page


<table border="1" bordercolor="#FFCC00" style="background-color:#FFFFCC" width="100%" cellpadding="3" cellspacing="3">

<tr align="center">

<th>Environment</th><th>Release Track</th><th>Artifact</th><th>Name</th><th>Build #</th><th>Cert Idn</th><th>Build Idn</th><th>Request Status</th><th>Update Time</th><th>Log Info.</th><th>Initiator</th>

</tr>

<tr>
<td>DEV03</td><td>2.1.0</td><td>abpa</td><td>ecom-abpa-ear</td><td>204</td><td>82113</td><td>171242</td><td>Deployed</td><td>3/18/2013 3:10:58 PM</td><td width="70">Log info</a></td><td>CESAR</td>
</tr>

<tr>
<td>DEV03</td><td>2.1.0</td><td>abpa</td><td>abpa_dynamic_config_properties</td><td>20</td><td>82113</td><td>167598</td><td>Deployed</td><td>3/18/2013 2:32:27 PM</td><td width="70">Log info</a></td><td>CESAR</td>

</tr>

</table>

My goal is to get this value from this cell.

"Deployed"

Another way to look at it...

Retrieve all data under the "Request Status" column

The value "Deployed" is dynamic and could change.

I have tried the following:

sed -e 's/>/>\n/g' abpa_cesar_status.txt | egrep -i "^\s*[A-Z]+&lt;/td&gt;
" | sed -e 's|&lt;/td&gt;||g' | grep Deployed

But that only greps for "Deployed"

Any ideas?

Upvotes: 3

Views: 5623

Answers (5)

&#201;douard Lopez
&#201;douard Lopez

Reputation: 43401

Note that your document output is ill-formed (lack some opening <a>), is it normal/excpected or a typo ? Otherwise, here is a well-formed version.

Command

I like xmlstarlet, simple and straight forward XPath for short tests:

xmlstarlet sel -t -m "//table/tr/td[position()=8]" -v "./text()" -n 

Explaination

sel   (or select)        - Select data (mode) or query XML document(s) (XPATH, etc)
-t or --template         - start a template
-m or --match <xpath>    - match XPATH expression
-v or --value-of <xpath> - print value of XPATH expression
-n or --nl               - print new line

Output

Deployed
Deployed
# plus empty-cell

Upvotes: 2

BeniBela
BeniBela

Reputation: 16917

You can also use my Xidel to get everything in the 8-th column:

xidel your_table.html -e '//table//tr/td[8]'

Or if the column position can also change, get the column-number first:

xidel your_table.html -e 'column:=count(//table//th[.="Request Status"]/preceding-sibling::*)+1' -e '//table//tr/td[$column]'

Upvotes: 3

choroba
choroba

Reputation: 241858

You can try xsh, a wrapper around XML::LibXML:

open :F html abpa_cesar_status.txt ;
$status = count(//table/tr[1]/th[.="Request Status"]/preceding-sibling::th) ;
ls //td[count(preceding-sibling::td)=$status] ;

In order to use it, you have to make your html a bit more well formed, though (I had to remove </a> to make the script work).

Upvotes: 2

L3viathan
L3viathan

Reputation: 27283

Quick and dirty:

cat your_html_file | perl -pe "s/^<\/?table.*$//g;s/^<tr .*$//g;s/<tr> (<td>.*?){8}//g;s/<th.*$//g;s/<\/.*$//g" | sed '/^$/d'

However, this is not how you should do it. Use existing (Perl?) software to parse html and extract your value.

edit: Since you changed your code (added whitespaces), this doesn't work anymore. QED.

Upvotes: 0

dogbane
dogbane

Reputation: 274612

You should use a parser such as xmllint to do this.

With xmllint you can extract elements based on an xpath.

For example:

$ xmllint --html --format --shell file.html <<< "cat //table/tr/td[position()=8]/text()"
/ >  -------
Deployed
 -------
Deployed
/ >

The xpath //table/tr/td[position()=8]/text(), in the command above, returns the values from the 8th table column.

Upvotes: 3

Related Questions