Reputation: 71
I have a Problem. I want to get two parts of this html in values with the sed or grep command. How i can extract both of them?
test.html:
<html>
<body>
<div id="foo" class="foo">
Some Text.
<p id="author" class="author">
<br>
<a href="example.com">bar</a>
</p>
</div>
</body>
</html>
script.sh
#!/bin/bash
author=$(sed 's/.*<p id="author" class="author"><br><a href="*">\(.*\)<\/a><\/p>.*/\1/p' test.html)
quote=$(sed 's/.*<div id="foo" class="foo">\(.*\)<\/div>.*/\1/p' test.html)
Under the line i want only the text in the values. without the html tags. But my script doesent works..
Upvotes: 5
Views: 8545
Reputation: 3423
Please don't use regex to parse HTML/XML, but use a dedicated parser like xidel instead:
$ xidel -s test.html -e '//p/a,//div/normalize-space(text())'
bar
Some Text.
$ eval $(xidel test.html -se 'author:=//p/a,quote:=//div/normalize-space(text())' --output-format=bash)
$ printf '%s\n' "$author" "$quote"
bar
Some Text.
Upvotes: 0
Reputation: 43
You can use html2text
# cat test.html | html2text
Some Text.
[bar](example.com)
I'm using very often with curl
# curl -s http://www.example.com/ | html2text
# Example Domain
This domain is for use in illustrative examples in documents. You may use this
domain in literature without prior coordination or asking for permission.
[More information...](https://www.iana.org/domains/example)
#
Upvotes: 2
Reputation: 186
You can use xmllint to parse html/xml text and extract values for defined xpath.
Here is the working example:
#!/bin/bash
author=$(xmllint --html --xpath '//div[@class="foo"]/text()' test.html | tr -d '\n' | sed -e "s/ //g")
quote=$(xmllint --html --xpath '//a/text()' test.html | sed -e "s/ //g")
echo "Author:'$author'"
echo "Quote:'$quote'"
Upvotes: 1
Reputation: 5903
text="$(sed 's:^ *::g' < test.html | tr -d \\n)"
author=$(sed 's:.*<p id="author" class="author"><br><a href="[^"]*">\([^<]*\)<.*:\1:' <<<"$text")
quote=$(sed 's:.*<div id="foo" class="foo">\([^<]*\)<.*:\1:' <<<"$text")
echo "'$author' '$quote'"
$text
is assigned an unindented single-line representation of test.html
; note that :
is used as a delimiter for sed
instead of /
, since any character is capable of being a delimiter, and the text we are parsing has /
-s present, so we don`t have to escape them with \
-s when constructing a regex.$author
is assumed to be between <p id="author" class="author"><br><a href="[^"]*">
(where [^"]*
means «any characters except "
, repeated N times, N ∈ [0, +∞)») and any tag that comes next.$quote
is assumed to be between <div id="foo" class="foo">
and any tag that comes next.<<<"$text"
is the so-called here-string, which is almost equivalent to echo "$text" |
placed at the beginning.Upvotes: 5