ITnewbie
ITnewbie

Reputation: 520

Extracting values from Html file <td> elements

I have the following string:

<td class="mytest" title="testfile" style="width:20%">0</td>

How do I get value within the td elements by using awk? In my case, it is 0.

I am very new to Linux, any help is appreciated!

Upvotes: 0

Views: 298

Answers (3)

Ed Morton
Ed Morton

Reputation: 203219

If your input is always that regular and you don't have and can't install XML-aware tools then using any sed in any shell on every Unix box:

$ sed 's:<td.*>\(.*\)</td>:\1:' file
0

I'm using sed instead of awk because simple substitutions on individual lines like this is what sed is best suited for. With GNU awk you could do this with the 3rd arg to match():

$ awk 'match($0,"<td.*>(.*)</td>",a){print a[1]}' file
0

but with a POSIX awk it'd be a bit more cryptic (there are alternative approaches of course):

$ awk 'sub("</td>","") && sub("<td.*>","")' file
0

Think about what the above is doing and test it to make sure you don't get any false matches. It's always much easier to match what you want than it is to not match similar strings you don't want.

Upvotes: 1

Daweo
Daweo

Reputation: 36370

If you are allowed to select your tool I would suggest using hxselect (from html-xml-utils), then if you have file.txt holding

<td class="mytest" title="testfile" style="width:20%">0</td>

it would be as simple as

cat file.txt | hxselect -i -c td

output

0

Explanation: -i match case insensitive, -c print content only, td is CSS selector. Disclaimer: there is not newline after 0 as there is not newline inside tag.

However if you are coerced into using installed base, then if linux machine you are using have installed python (which if I am not mistaken, recent Ubuntu versions do have by default), you might harness html.parser as follows, create tdextract.py file with following content

import sys
from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
    def __init__(self):
        self.insidetd = False
        super().__init__()

    def handle_starttag(self, tag, attrs):
        if tag == "td":
            self.insidetd = True

    def handle_endtag(self, tag):
        if tag == "td":
            self.insidetd = False

    def handle_data(self, data):
        if self.insidetd:
            sys.stdout.write(data)

parser = MyHTMLParser()
parser.feed(sys.stdin.read())

then do

cat file.txt | python tdextract.py

which will give same output as hxselect described earlier. Be warned that python use indentation for marking blocks, so it is crucially important to keep number of leading spaces.

Upvotes: 2

Ted Lyngmo
Ted Lyngmo

Reputation: 117258

One option could be to use (xmllint --html) with an to extract the value.

Example:

#!/bin/bash
data='<td class="mytest" title="testfile" style="width:20%">0</td>'
value=$(xmllint --html --xpath '//html/body/td/text()' - <<< "$data")
echo "$value"

Output:

0

Upvotes: 2

Related Questions