Joao
Joao

Reputation: 619

Get content between a pair of HTML tags using Bash

I need to get the HTML contents between a pair of given tags using a bash script. As an example, having the HTML code below:

<html>
<head>
</head>
<body>
 text
  <div>
  text2
    <div>
        text3
    </div>
  </div>
</body>
</html>

Using the bash command/script, given the body tag, we would get:

 text
  <div>
  text2
    <div>
    text3
    </div>
  </div>

Thanks in advance.

Upvotes: 19

Views: 57780

Answers (8)

joeblog
joeblog

Reputation: 1225

I just discovered a really nice Unix command line tool for this, hq.

I use Arch-Linux where installing was simply pacman -S hq.

My problem was extracting json-ld from html headers, and with hq you just go

curl -sSL https://www.example.com | hq '[type="application/ld+json"]' text

Upvotes: 1

mklement0
mklement0

Reputation: 437197

Another option is to use the multi-platform xidel utility (home page on SourceForge, GitHub repository), which can handle both XML and HTML:

xidel -s in.html -e '/html/body/node()' --printed-node-format=html

The above prints the resulting HTML with syntax highlighting (colored), and seemingly with an empty line after the text node.

If you want the text only, Reino points out that you can simplify to:

xidel -s in.html -e '/html/body/inner-html()'

Upvotes: 5

socrates
socrates

Reputation: 1323

Consider using beautifulspoon.

Select the body tag from the above .html:

$ beautifulspoon example.html --select body
<body>
 text
 <div>
  text2
  <div>
   text3
  </div>
 </div>
</body>

And to unwrap the tag:

$ beautifulspoon example.html --select body |beautifulspoon --select body --unwrap
text
<div>
 text2
 <div>
  text3
 </div>
</div>

Upvotes: 0

Cromax
Cromax

Reputation: 2042

Personally I find it very useful to use hxselect command (often with help of hxclean) from package html-xml-utils. The latter fixes (sometimes broken) HTML file to correct XML file and the first one allows to use CSS selectors to get the node(s) you need. With use of the -c option, it strips surrounding tags. All these commands work on stdin and stdout. So in your case you should execute:

$ hxselect -c body <<HTML
  <html>
  <head>
  </head>
  <body>
    text
    <div>
      text2
      <div>
        text3
      </div>
    </div>
  </body>
  </html>
  HTML 

to get what you need. Plain and simple.

Upvotes: 11

BMW
BMW

Reputation: 45223

Using sed in shell/bash, so you needn't install something else.

tag=body
sed -n "/<$tag>/,/<\/$tag>/p" file

Upvotes: 16

Paulo Fidalgo
Paulo Fidalgo

Reputation: 22296

Forgetting Bash due it's limitation, you can use nokogiri as command line util, as explained here.

Example:

curl -s http://example.com/ | nokogiri -e 'puts $_.search('\''a'\'')'

Upvotes: 7

Aaron Digulla
Aaron Digulla

Reputation: 328556

BASH is probably the wrong tool for this. Try a Python script using the powerful Beautiful Soup library instead.

It will be more work upfront but in the long run (here: after one hour), the time savings will make up for the additional effort.

Upvotes: -1

Kent
Kent

Reputation: 195039

plain text processing is not good for html/xml parsing. I hope this could give you some idea:

kent$  xmllint --xpath "//body" f.html 
<body>
 text
  <div>
  text2
    <div>
        text3
    </div>
  </div>
</body>

Upvotes: 17

Related Questions