Reputation: 53
I'm trying to extract 3 columns from a table in HTML. I need hostname, product + region and date added. So they would be columns 1, 3, 4.
<div class="table sectionedit2">
<table class="inline">
<tr class="row0">
<th class="col0 centeralign">hostname</th>
<th class="col1 centeralign">AKA (Client hostname)</th>
<th class="col2 leftalign">Product + Region</th>
<th class="col3 centeralign">date added</th>
<th class="col4 centeralign"> decom. date </th>
<th class="col5 centeralign"> builder </th>
<th class="col6 centeralign"> build cross-checker </th>
<th class="col7 leftalign"> <strong>decommissioner</strong></th>
<th class="col8 centeralign">customer managed filesystems</th>
<th class="col9 centeralign"> only company has root? </th>
</tr>
<tr class="row1">
<th class="col0 centeralign">HostName01</th>
<td class="col1 leftalign">Host01</td>
<td class="col2 leftalign">EU</td>
<td class="col3 centeralign">2007-01-01</td>
<td class="col4 leftalign"></td>
<td class="col5 centeralign">Me</td>
<td class="col6 centeralign">You</td>
<td class="col7 leftalign">Builder01</td>
<td class="col8 leftalign">xChecker01</td>
<td class="col9 centeralign">yes</td>
</tr>
<tr class="row2">
<th class="col0 centeralign">HostName02</th>
<td class="col1 leftalign">Host02</td>
<td class="col2 leftalign">U.S</td>
<td class="col3 centeralign">2008-09-29</td>
<td class="col4 leftalign"></td>
<td class="col5 leftalign">Me01</td>
<td class="col6 leftalign">You01</td>
<td class="col7 leftalign">Builder02</td>
<td class="col8 leftalign">xChecker02</td>
<td class="col9 centeralign">yes</td>
I want to get:
Hostname Product + Region Date added
HostName01 EU 2007-01-01
HostName02 U.S 2008-09-29
Previously I tried stripping the HTML tags and using awk, although some of the columns in the table are empty. This means I didn't get colums 1, 3 and 4 for all the rows.
I am trying to use:
xmllint --html --shell --format table.log <<< "cat //table/tr/th/td[1]/text()"
This is giving me the second column, I tried "[0]" which doesn't work and I'm not sure how to get multiple columns at once.
Upvotes: 4
Views: 3359
Reputation: 87984
You can do the following:
xmllint --xpath
with an XPath expression that uses position()=
to grab just columns 1, 3, and 4: //table/tr/*[position()=1 or position()=3 or position()=4]
perl -pe "s/<th class=\"col0/\n<th class=\"col0/g"
, etc., to strip out the markup and break it up into separate linesgrep -v '^\s*$'
to strip out blank linescolumn -t
at the end to pretty-print itLike this:
xmllint --html \
--xpath "//table/tr/*[position()=1 or position()=3 or position()=4]" \
table.log \
| perl -pe "s/<th class=\"col0/\n<th class=\"col0/g" \
| perl -pe 's/<tr[^>]+>//' \
| perl -pe 's/<\/tr>//' \
| perl -pe 's/<t[dh][^>]*>//' \
| perl -pe 's/<\/t[dh]><t[dh][^>]*>/|/g' \
| perl -pe 's/<\/t[dh]>//' \
| grep -v '^\s*$' \
| column -t -s '|'
The above assumes the HTML document is in the file table.log
(which seems like an odd name for an HTML file, but it appears to be the name that’s used in the question…). If the document is actually in some other *.html
file, of course just put the actual filename.
That will give you output like this:
hostname Product + Region date added
HostName01 EU 2007-01-01
HostName02 U.S 2008-09-29
Upvotes: 5
Reputation: 246744
Assuming your html is well-formed xml, xmlstarlet can do it:
xmlstarlet sel -t -m '//table/tr' -v '*[contains(@class,"col0")]' -o $'\t' \
-v '*[contains(@class,"col2")]' -o $'\t' \
-v '*[contains(@class,"col3")]' -n \
file.html
hostname Product + Region date added
HostName01 EU 2007-01-01
HostName02 U.S 2008-09-29
Upvotes: 2