Lenny
Lenny

Reputation: 917

Parse HTML using shell

I have a HTML with lots of data and part I am interested in:

<tr valign=top>
<td><b>Total</b></td>
<td align=right><b>54</b></td>
<td align=right><b>1</b></td>
<td align=right>0 (0/0)</td>
<td align=right><b>0</b></td>
</tr>

I try to use awk which now is:

awk -F "</*b>|</td>" '/<[b]>.*[0-9]/ {print $1, $2, $3 }' "index.html"

but what I want is to have:

54
1
0
0

Right now I am getting:

'<td align=right> 54'
'<td align=right> 1'
'<td align=right> 0'

Any suggestions?

Upvotes: 22

Views: 55668

Answers (11)

baptx
baptx

Reputation: 3916

In the past I used PhantomJS but now you can do it with similar tools that are still maintained like Selenium WebDriver.

It makes it possible to use DOM API functions with JavaScript in a headless browser like Firefox or Chromium and you can call the script (written in Node.js or Python for example) in your shell script if you want to do additional treatment in the shell.

Upvotes: 0

Jean
Jean

Reputation: 11

What about:

lynx -dump index.html

Upvotes: 1

Reino
Reino

Reputation: 3423

With , a true HTML parser, and XPath:

$ xidel -s "input.html" -e '//td[@align="right"]'
54
1
0 (0/0)
0

$ xidel -s "input.html" -e '//td[@align="right"]/tokenize(.)[1]'
# or
$ xidel -s "input.html" -e '//td[@align="right"]/extract(.,"\d+")'
54
1
0
0

Upvotes: 2

greyfade
greyfade

Reputation: 25677

I was recently pointed to pup, which in the limited testing I've done, is much more forgiving with invalid HTML and tag soup.

cat <<'EOF' | pup -c 'td + td text{}'
<table>
<tr valign=top>
<td><b>Total</b></td>
<td align=right><b>54</b></td>
<td align=right><b>1</b></td>
<td align=right>0 (0/0)</td>
<td align=right><b>0</b></td>
</tr>
</table>
EOF

Prints:

54
1
0 (0/0)
0

Upvotes: 2

kenorb
kenorb

Reputation: 166879

BSD/GNU grep/ripgrep

For simple extracting, you can use grep, for example:

  • Your example using grep:

    $ egrep -o "[0-9][^<]\?\+" file.html
    54
    1
    0 (0/0)
    0
    

    and using ripgrep:

    $ rg -o ">([^>]+)<" -r '$1' <file.html | tail +2
    54
    1
    0 (0/0)
    0
    
  • Extracting outer html of H1:

    $ curl -s http://example.com/ | egrep -o '<h1>.*</h1>'
    <h1>Example Domain</h1>
    

Other examples:

  • Extracting the body:

    $ curl -s http://example.com/ | xargs | egrep -o '<body>.*</body>'
    <body> <div> <h1>Example Domain</h1> ...
    

    Instead of xargs you can also use tr '\n' ' '.

  • For multiple tags, see: Text between two tags.

If you're dealing with large datasets, consider using ripgrep which has similar syntax, but it's a way faster since it's written in Rust.

Upvotes: 4

kenorb
kenorb

Reputation: 166879

ex/vim

For more advanced parsing, you may use in-place editors such as ex/vi where you can jump between matching HTML tags, selecting/deleting inner/outer tags, and edit the content in-place.

Here is the command:

$ ex +"%s/^[^>].*>\([^<]\+\)<.*/\1/g" +"g/[a-zA-Z]/d" +%p -scq! file.html
54
1
0 (0/0)
0

This is how the command works:

  • Use ex in-place editor to substitute on all lines (%) by: ex +"%s/pattern/replace/g".

    The substitution pattern consists of 3 parts:

    • Select from the beginning of line till > (^[^>].*>) for removal, right before the 2nd part.
    • Select our main part till < (([^<]+)).
    • Select everything else after < for removal (<.*).
    • We replace the whole matching line with \1 which refers to pattern inside the brackets (()).
  • After substitution, we remove any alphanumeric lines by using global: g/[a-zA-Z]/d.

  • Finally, print the current buffer on the screen by +%p.
  • Then silently (-s) quit without saving (-c "q!"), or save into the file (-c "wq").

When tested, to replace file in-place, change -scq! to -scwq.


Here is another simple example which removes style tag from the header and prints the parsed output:

$ curl -s http://example.com/ | ex -s +'/<style.*/norm nvatd' +%p -cq! /dev/stdin

However, it's not advised to use regex for parsing your html, therefore for long-term approach you should use the appropriate language (such as Python, perl or PHP DOM).


See also:

Upvotes: 1

kenorb
kenorb

Reputation: 166879

HTML-XML-utils

You may use htmlutils for parsing well-formatted HTML/XML files. The package includes a lot of binary tools to extract or modify the data. For example:

$ curl -s http://example.com/ | hxselect title
<title>Example Domain</title>

Here is the example with provided data:

$ hxselect -c -s "\n" "td[align=right]" <file.html
<b>54</b>
<b>1</b>
0 (0/0)
<b>0</b>

Here is the final example with stripping out <b> tags:

$ hxselect -c -s "\n" "td[align=right]" <file.html | sed "s/<[^>]\+>//g"
54
1
0 (0/0)
0

For more examples, check the .

Upvotes: 7

clt60
clt60

Reputation: 63974

You really should to use some real HTML parser for this job, like:

perl -Mojo -0777 -nlE 'say [split(/\s/, $_->all_text)]->[0] for x($_)->find("td[align=right]")->each'

prints:

54
1
0
0

But for this you need to have perl, and installed Mojolicious package.

(it is easy to install with:)

curl -L get.mojolicio.us | sh

Upvotes: 4

Ed Morton
Ed Morton

Reputation: 204558

$ awk -F'<td[^>]*>(<b>)?|(</?b>)?</td>' '$2~/[0-9]/{print $2+0}' file
54
1
0
0

Upvotes: 4

konsolebox
konsolebox

Reputation: 75588

awk  -F '[<>]' '/<td / { gsub(/<b>/, ""); sub(/ .*/, "", $3); print $3 } ' file

Output:

54
1
0
0

Another:

awk  -F '[<>]' '
/<td><b>Total<\/b><\/td>/ {
    while (getline > 0 && /<td /) {
        gsub(/<b>/, ""); sub(/ .*/, "", $3)
        print $3
    }
    exit
}' file

Upvotes: 14

hek2mgl
hek2mgl

Reputation: 158230

awk is not an HTML parser. Use xpath or even xslt for that. xmllint is a commandline tool which is able to execute XPath queries and xsltproc can be used to perform XSL transformations. Both tools belong to the package libxml2-utils.

Also you can use a programming language which is able to parse HTML

Upvotes: 30

Related Questions