Reputation: 917
I have a HTML with lots of data and part I am interested in:
<tr valign=top>
<td><b>Total</b></td>
<td align=right><b>54</b></td>
<td align=right><b>1</b></td>
<td align=right>0 (0/0)</td>
<td align=right><b>0</b></td>
</tr>
I try to use awk
which now is:
awk -F "</*b>|</td>" '/<[b]>.*[0-9]/ {print $1, $2, $3 }' "index.html"
but what I want is to have:
54
1
0
0
Right now I am getting:
'<td align=right> 54'
'<td align=right> 1'
'<td align=right> 0'
Any suggestions?
Upvotes: 22
Views: 55668
Reputation: 3916
In the past I used PhantomJS but now you can do it with similar tools that are still maintained like Selenium WebDriver.
It makes it possible to use DOM API functions with JavaScript in a headless browser like Firefox or Chromium and you can call the script (written in Node.js or Python for example) in your shell script if you want to do additional treatment in the shell.
Upvotes: 0
Reputation: 3423
With xidel, a true HTML parser, and XPath:
$ xidel -s "input.html" -e '//td[@align="right"]'
54
1
0 (0/0)
0
$ xidel -s "input.html" -e '//td[@align="right"]/tokenize(.)[1]'
# or
$ xidel -s "input.html" -e '//td[@align="right"]/extract(.,"\d+")'
54
1
0
0
Upvotes: 2
Reputation: 25677
I was recently pointed to pup
, which in the limited testing I've done, is much more forgiving with invalid HTML and tag soup.
cat <<'EOF' | pup -c 'td + td text{}'
<table>
<tr valign=top>
<td><b>Total</b></td>
<td align=right><b>54</b></td>
<td align=right><b>1</b></td>
<td align=right>0 (0/0)</td>
<td align=right><b>0</b></td>
</tr>
</table>
EOF
Prints:
54
1
0 (0/0)
0
Upvotes: 2
Reputation: 166879
grep
/ripgrep
For simple extracting, you can use grep
, for example:
Your example using grep
:
$ egrep -o "[0-9][^<]\?\+" file.html
54
1
0 (0/0)
0
and using ripgrep
:
$ rg -o ">([^>]+)<" -r '$1' <file.html | tail +2
54
1
0 (0/0)
0
Extracting outer html of H1:
$ curl -s http://example.com/ | egrep -o '<h1>.*</h1>'
<h1>Example Domain</h1>
Other examples:
Extracting the body:
$ curl -s http://example.com/ | xargs | egrep -o '<body>.*</body>'
<body> <div> <h1>Example Domain</h1> ...
Instead of xargs
you can also use tr '\n' ' '
.
For multiple tags, see: Text between two tags.
If you're dealing with large datasets, consider using ripgrep
which has similar syntax, but it's a way faster since it's written in Rust.
Upvotes: 4
Reputation: 166879
ex
/vim
For more advanced parsing, you may use in-place editors such as ex/vi where you can jump between matching HTML tags, selecting/deleting inner/outer tags, and edit the content in-place.
Here is the command:
$ ex +"%s/^[^>].*>\([^<]\+\)<.*/\1/g" +"g/[a-zA-Z]/d" +%p -scq! file.html
54
1
0 (0/0)
0
This is how the command works:
Use ex
in-place editor to substitute on all lines (%
) by: ex +"%s/pattern/replace/g"
.
The substitution pattern consists of 3 parts:
>
(^[^>].*>
) for removal, right before the 2nd part.<
(([^<]+)
).<
for removal (<.*
).\1
which refers to pattern inside the brackets (()
).After substitution, we remove any alphanumeric lines by using global: g/[a-zA-Z]/d
.
+%p
.-s
) quit without saving (-c "q!"
), or save into the file (-c "wq"
).When tested, to replace file in-place, change -scq!
to -scwq
.
Here is another simple example which removes style tag from the header and prints the parsed output:
$ curl -s http://example.com/ | ex -s +'/<style.*/norm nvatd' +%p -cq! /dev/stdin
However, it's not advised to use regex for parsing your html, therefore for long-term approach you should use the appropriate language (such as Python, perl or PHP DOM).
See also:
Upvotes: 1
Reputation: 166879
HTML-XML-utils
You may use htmlutils
for parsing well-formatted HTML/XML files. The package includes a lot of binary tools to extract or modify the data. For example:
$ curl -s http://example.com/ | hxselect title
<title>Example Domain</title>
Here is the example with provided data:
$ hxselect -c -s "\n" "td[align=right]" <file.html
<b>54</b>
<b>1</b>
0 (0/0)
<b>0</b>
Here is the final example with stripping out <b>
tags:
$ hxselect -c -s "\n" "td[align=right]" <file.html | sed "s/<[^>]\+>//g"
54
1
0 (0/0)
0
For more examples, check the html-xml-utils.
Upvotes: 7
Reputation: 63974
You really should to use some real HTML parser for this job, like:
perl -Mojo -0777 -nlE 'say [split(/\s/, $_->all_text)]->[0] for x($_)->find("td[align=right]")->each'
prints:
54
1
0
0
But for this you need to have perl, and installed Mojolicious package.
(it is easy to install with:)
curl -L get.mojolicio.us | sh
Upvotes: 4
Reputation: 204558
$ awk -F'<td[^>]*>(<b>)?|(</?b>)?</td>' '$2~/[0-9]/{print $2+0}' file
54
1
0
0
Upvotes: 4
Reputation: 75588
awk -F '[<>]' '/<td / { gsub(/<b>/, ""); sub(/ .*/, "", $3); print $3 } ' file
Output:
54
1
0
0
Another:
awk -F '[<>]' '
/<td><b>Total<\/b><\/td>/ {
while (getline > 0 && /<td /) {
gsub(/<b>/, ""); sub(/ .*/, "", $3)
print $3
}
exit
}' file
Upvotes: 14
Reputation: 158230
awk
is not an HTML parser. Use xpath
or even xslt
for that. xmllint
is a commandline tool which is able to execute XPath queries and xsltproc
can be used to perform XSL transformations. Both tools belong to the package libxml2-utils
.
Also you can use a programming language which is able to parse HTML
Upvotes: 30