Javier Isaaí
Javier Isaaí

Reputation: 450

Extract text from HTML based on table column via Shell Script

I need to write a shell script that reads an html file sample.html and extracts data from a table column, based on another table column. For example, this is the HTML code:

<table style="BORDER-COLLAPSE: collapse"
  border="0" bordercolor="#000000"
  cellpadding="3" cellspacing="0" width="100%" height="200">
  <tr >
    <td class="fontStyleOne" width="30%">
      <div align="left">
      core6690.myserverdomain.com </div>
    </td>
    <td  class="tdfontTwo" width="30%">
      <div class="label-styler" align="left">
      admin</div>
    </td>
  </tr>
  <tr >
    <td class="fontStyleOne" width="30%">
      <div align="left">
      core6691.myserverdomain.com </div>
    </td>
    <td  class="tdfontTwo" width="30%">
      <div class="label-styler" align="left">
      secondary </div>
    </td>
  </tr>
  <tr >
    <td  class="fontStyleOne" width="30%">
      <div align="left">
      core6692.myserverdomain.com </div>
    </td>
    <td  class="tdfontTwo" width="30%">
      <div class="label-styler" align="left">
      primary </div>
    </td>
  </tr>
</table>

Let's say that I want to determine what the URL for "admin" then the result would be core6690.myserverdomain.com; if I my input is "primary" then the output would be "core6692.myserverdomain.com" and so on...

The HTML page has a lot more data, header tags, footer stuff, etc., but the important stuff that I am looking for is placed inside a table with the exact same structure I list in the code... except it has many more rows, not necessarily just 3 as in this example.

I have seen related answers in this site that seg, grep, regular expressions, awk, and other tools however none of them are close enough to what I am looking for... plus I do not have much experience with any of the approaches as to modify and make them fit my needs.

Any suggestions? Thanks in advance.

Upvotes: 0

Views: 4693

Answers (2)

servn
servn

Reputation: 36

#/bin/bash

for i in `cat sample.html | grep '<\/div>' | sed 's/\s\+//'|sed 's/<.*>//'`; do
    if [ $i == $1 ];
    then
        echo $prev
    fi
    prev=$i
done

Example of using

$ ./filter.sh primary
core6692.myserverdomain.com

P.s: format of the sample.html should be exacly you posted here, server and the name shouldends with tag and starts with whitespace or tab.

Upvotes: 1

BeniBela
BeniBela

Reputation: 16917

My Xidel can do that, if you are allowed to use other tools.

With xpath:

xidel /tmp/f.html -e "//tr[td[2] = 'admin']/td[1]"

or pattern matching:

xidel /tmp/f.html -e "<tr><td>{.}</td><td>admin</td></tr>

At least that's how it is done for the excerpt you posted, for the larger file it depends on what else is there.

Upvotes: 4

Related Questions