noooob
noooob

Reputation: 53

Parse HTML table columns with bash

I'm trying to extract 3 columns from a table in HTML. I need hostname, product + region and date added. So they would be columns 1, 3, 4.

<div class="table sectionedit2">
  <table class="inline">
    <tr class="row0">
      <th class="col0 centeralign">hostname</th>
      <th class="col1 centeralign">AKA (Client hostname)</th>
      <th class="col2 leftalign">Product + Region</th>
      <th class="col3 centeralign">date added</th>
      <th class="col4 centeralign">  decom. date  </th>
      <th class="col5 centeralign">           builder           </th>
      <th class="col6 centeralign">  build cross-checker  </th>
      <th class="col7 leftalign"> <strong>decommissioner</strong></th>
      <th class="col8 centeralign">customer managed filesystems</th>
      <th class="col9 centeralign">  only company has root?  </th>
    </tr>
    <tr class="row1">
      <th class="col0 centeralign">HostName01</th>
      <td class="col1 leftalign">Host01</td>
      <td class="col2 leftalign">EU</td>
      <td class="col3 centeralign">2007-01-01</td>
      <td class="col4 leftalign"></td>
      <td class="col5 centeralign">Me</td>
      <td class="col6 centeralign">You</td>
      <td class="col7 leftalign">Builder01</td>
      <td class="col8 leftalign">xChecker01</td>
      <td class="col9 centeralign">yes</td>
    </tr>
   <tr class="row2">
     <th class="col0 centeralign">HostName02</th>
     <td class="col1 leftalign">Host02</td>
     <td class="col2 leftalign">U.S</td>
     <td class="col3 centeralign">2008-09-29</td>
     <td class="col4 leftalign"></td>
     <td class="col5 leftalign">Me01</td>
     <td class="col6 leftalign">You01</td>
     <td class="col7 leftalign">Builder02</td>
     <td class="col8 leftalign">xChecker02</td>
     <td class="col9 centeralign">yes</td>

I want to get:

Hostname     Product + Region   Date added

HostName01   EU                 2007-01-01

HostName02   U.S                2008-09-29

Previously I tried stripping the HTML tags and using awk, although some of the columns in the table are empty. This means I didn't get colums 1, 3 and 4 for all the rows.

I am trying to use:

xmllint --html --shell --format table.log <<< "cat //table/tr/th/td[1]/text()"

This is giving me the second column, I tried "[0]" which doesn't work and I'm not sure how to get multiple columns at once.

Upvotes: 4

Views: 3359

Answers (2)

sideshowbarker
sideshowbarker

Reputation: 87984

You can do the following:

  • run xmllint --xpath with an XPath expression that uses position()= to grab just columns 1, 3, and 4: //table/tr/*[position()=1 or position()=3 or position()=4]
  • pipe through perl -pe "s/<th class=\"col0/\n<th class=\"col0/g", etc., to strip out the markup and break it up into separate lines
  • pipe through grep -v '^\s*$' to strip out blank lines
  • pipe through column -t at the end to pretty-print it

Like this:

xmllint --html \
  --xpath "//table/tr/*[position()=1 or position()=3 or position()=4]" \
    table.log \
    | perl -pe "s/<th class=\"col0/\n<th class=\"col0/g" \
    | perl -pe 's/<tr[^>]+>//' \
    | perl -pe 's/<\/tr>//' \
    | perl -pe 's/<t[dh][^>]*>//' \
    | perl -pe 's/<\/t[dh]><t[dh][^>]*>/|/g' \
    | perl -pe 's/<\/t[dh]>//' \
    | grep -v '^\s*$' \
    | column -t -s '|'

The above assumes the HTML document is in the file table.log (which seems like an odd name for an HTML file, but it appears to be the name that’s used in the question…). If the document is actually in some other *.html file, of course just put the actual filename.

That will give you output like this:

hostname    Product + Region  date added
HostName01  EU                2007-01-01
HostName02  U.S               2008-09-29

Upvotes: 5

glenn jackman
glenn jackman

Reputation: 246744

Assuming your html is well-formed xml, can do it:

xmlstarlet sel -t -m '//table/tr' -v '*[contains(@class,"col0")]' -o $'\t' \
                                  -v '*[contains(@class,"col2")]' -o $'\t' \
                                  -v '*[contains(@class,"col3")]' -n       \
    file.html
hostname    Product + Region    date added
HostName01  EU  2007-01-01
HostName02  U.S 2008-09-29

Upvotes: 2

Related Questions