loyd
loyd

Reputation: 147

Extracting string from html text

I am getting html with curl and need to extract only the second table statement. Mind that the curled html is a single string and not formated. For better explaination see the following: (... stands for more html)

...
<table width="100%" cellpadding="0" cellspacing="0" class="table">
...
</table>
...
#I need to extract the following table
#from here
<table width="100%" cellpadding="4">
...
</table> #to this
...

I tried multiple SED lines so far, also I think that trying to match the second table like this is not the smooth way:

sed -n '/<table width="100%" cellpadding="4"/,/table>/p'

Upvotes: 0

Views: 66

Answers (2)

curusarn
curusarn

Reputation: 403

Save the script below as script.py and run it like this:

python3 script.py input.html

This script parses the HTML and checks for the attributes (width and cellpadding). The advantage of this approach is that if you change the formatting of the HTML file it will still work because the script parses the HTML instead of relying on exact string matching.

from html.parser import HTMLParser
import sys

def print_tag(tag, attrs, end=False):
    line = "<" 
    if end:
        line += "/"
    line += tag
    for attr, value in attrs:
        line += " " + attr + '="' + value + '"'
    print(line + ">", end="")

if len(sys.argv) < 2:
    print("ERROR: expected argument - filename")
    sys.exit(1)

with open(sys.argv[1], 'r', encoding='cp1252') as content_file:
    content = content_file.read()

do_print = False

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        global do_print
        if tag == "table":
            if ("width", "100%") in attrs and ("cellpadding", "4") in attrs:
                do_print = True
        if do_print:
            print_tag(tag, attrs)

    def handle_endtag(self, tag):
        global do_print
        if do_print:
            print_tag(tag, attrs=(), end=True)
            if tag == "table":
                do_print = False

    def handle_data(self, data):
        global do_print
        if do_print:
            print(data, end="")

parser = MyHTMLParser()
parser.feed(content)

Upvotes: 1

Jotne
Jotne

Reputation: 41446

An html parser would be better, but you can use awk like this:

awk '/<table width="100%" cellpadding="4">/ {f=1} f; /<\/table>/ {f=0}' file
<table width="100%" cellpadding="4">
...
</table> #to this
  • /<table width="100%" cellpadding="4">/ {f=1} when start is found set flag f to true
  • f; if flage f is true, do default action, print line.
  • /<\/table>/ {f=0} when end is found, clear flag f to stop print.

This could also be used, but like the flag control better:

awk '/<table width="100%" cellpadding="4">/,/<\/table>/' file
<table width="100%" cellpadding="4">
...
</table> #to this

Upvotes: 2

Related Questions