Reputation: 619
I need to get the HTML contents between a pair of given tags using a bash script. As an example, having the HTML code below:
<html>
<head>
</head>
<body>
text
<div>
text2
<div>
text3
</div>
</div>
</body>
</html>
Using the bash command/script, given the body tag, we would get:
text
<div>
text2
<div>
text3
</div>
</div>
Thanks in advance.
Upvotes: 19
Views: 57780
Reputation: 1225
I just discovered a really nice Unix command line tool for this, hq.
I use Arch-Linux where installing was simply pacman -S hq
.
My problem was extracting json-ld from html headers, and with hq you just go
curl -sSL https://www.example.com | hq '[type="application/ld+json"]' text
Upvotes: 1
Reputation: 437197
Another option is to use the multi-platform xidel
utility (home page on SourceForge, GitHub repository), which can handle both XML and HTML:
xidel -s in.html -e '/html/body/node()' --printed-node-format=html
The above prints the resulting HTML with syntax highlighting (colored), and seemingly with an empty line after the text
node.
If you want the text only, Reino points out that you can simplify to:
xidel -s in.html -e '/html/body/inner-html()'
Upvotes: 5
Reputation: 1323
Consider using beautifulspoon.
Select the body tag from the above .html:
$ beautifulspoon example.html --select body
<body>
text
<div>
text2
<div>
text3
</div>
</div>
</body>
And to unwrap the tag:
$ beautifulspoon example.html --select body |beautifulspoon --select body --unwrap
text
<div>
text2
<div>
text3
</div>
</div>
Upvotes: 0
Reputation: 2042
Personally I find it very useful to use hxselect
command (often with help of hxclean
) from package html-xml-utils. The latter fixes (sometimes broken) HTML file to correct XML file and the first one allows to use CSS selectors to get the node(s) you need. With use of the -c
option, it strips surrounding tags. All these commands work on stdin and stdout. So in your case you should execute:
$ hxselect -c body <<HTML
<html>
<head>
</head>
<body>
text
<div>
text2
<div>
text3
</div>
</div>
</body>
</html>
HTML
to get what you need. Plain and simple.
Upvotes: 11
Reputation: 45223
Using sed in shell/bash, so you needn't install something else.
tag=body
sed -n "/<$tag>/,/<\/$tag>/p" file
Upvotes: 16
Reputation: 22296
Forgetting Bash due it's limitation, you can use nokogiri as command line util, as explained here.
Example:
curl -s http://example.com/ | nokogiri -e 'puts $_.search('\''a'\'')'
Upvotes: 7
Reputation: 328556
BASH is probably the wrong tool for this. Try a Python script using the powerful Beautiful Soup library instead.
It will be more work upfront but in the long run (here: after one hour), the time savings will make up for the additional effort.
Upvotes: -1
Reputation: 195039
plain text processing is not good for html/xml parsing. I hope this could give you some idea:
kent$ xmllint --xpath "//body" f.html
<body>
text
<div>
text2
<div>
text3
</div>
</div>
</body>
Upvotes: 17