Reputation: 450
I need to write a shell script that reads an html file sample.html
and extracts data from a table column, based on another table column. For example, this is the HTML code:
<table style="BORDER-COLLAPSE: collapse"
border="0" bordercolor="#000000"
cellpadding="3" cellspacing="0" width="100%" height="200">
<tr >
<td class="fontStyleOne" width="30%">
<div align="left">
core6690.myserverdomain.com </div>
</td>
<td class="tdfontTwo" width="30%">
<div class="label-styler" align="left">
admin</div>
</td>
</tr>
<tr >
<td class="fontStyleOne" width="30%">
<div align="left">
core6691.myserverdomain.com </div>
</td>
<td class="tdfontTwo" width="30%">
<div class="label-styler" align="left">
secondary </div>
</td>
</tr>
<tr >
<td class="fontStyleOne" width="30%">
<div align="left">
core6692.myserverdomain.com </div>
</td>
<td class="tdfontTwo" width="30%">
<div class="label-styler" align="left">
primary </div>
</td>
</tr>
</table>
Let's say that I want to determine what the URL for "admin" then the result would be core6690.myserverdomain.com; if I my input is "primary" then the output would be "core6692.myserverdomain.com" and so on...
The HTML page has a lot more data, header tags, footer stuff, etc., but the important stuff that I am looking for is placed inside a table with the exact same structure I list in the code... except it has many more rows, not necessarily just 3 as in this example.
I have seen related answers in this site that seg, grep, regular expressions, awk, and other tools however none of them are close enough to what I am looking for... plus I do not have much experience with any of the approaches as to modify and make them fit my needs.
Any suggestions? Thanks in advance.
Upvotes: 0
Views: 4693
Reputation: 36
#/bin/bash
for i in `cat sample.html | grep '<\/div>' | sed 's/\s\+//'|sed 's/<.*>//'`; do
if [ $i == $1 ];
then
echo $prev
fi
prev=$i
done
Example of using
$ ./filter.sh primary
core6692.myserverdomain.com
P.s: format of the sample.html should be exacly you posted here, server and the name shouldends with tag and starts with whitespace or tab.
Upvotes: 1
Reputation: 16917
My Xidel can do that, if you are allowed to use other tools.
With xpath:
xidel /tmp/f.html -e "//tr[td[2] = 'admin']/td[1]"
or pattern matching:
xidel /tmp/f.html -e "<tr><td>{.}</td><td>admin</td></tr>
At least that's how it is done for the excerpt you posted, for the larger file it depends on what else is there.
Upvotes: 4