Reputation: 11700

Find specific tags in a HTML file

I have some html files and want to extract the contents between some tags: The title of the page some tagged content here.

<p>A paragraph comes here</p>
<p>A paragraph comes here</p><span class="more-about">Some text here</span><p class="en-cpy">Copyright &copy; 2012 </p>

I just want these tags: head, p but as could be seen in the second paragraph, the last tag is which starts with p but is not my desires tag, and I don't want its content. I used following script for extracting my desired text, but I can't filter out the tags such as the last one in my example.... How is it possible to extract just <p> tags?

grep "<p>" $File | sed -e 's/^[ \t]*//'

I have to add that, the last tag (which I don't want to appear in the output) is right after one of my desired tags (as is in my example) and using grep command all the content of that line would be returned as output... (This is my problem)

Upvotes: 0

Answers (3)

Jim Deville

Reputation: 10662

Don't. Trying to use regex to parse HTML is going to be painful. Use something like Ruby and Nokogiri, or a similar language + library that you are familiar with.

Upvotes: 3

ormaaj

Reputation: 6577

xmllint --html --xpath "//*[name()='head' or name()='p']" "$file"

If you're dealing with broken HTML you might need a different parser. Here's a "one-liner" basically the same using lxml. Just pass the script your URL

#!/usr/bin/env python3
from lxml import etree
import sys

print('\n'.join(etree.tostring(x, encoding="utf-8", with_tail=False).decode("utf-8") for x in (lambda i: etree.parse(i, etree.HTMLParser(remove_blank_text=1, remove_comments=1)).xpath("//*[name()='p' or name()='head']"))(sys.argv[0])))

Upvotes: 0

Nahuel Fouilleul

Reputation: 19315

to extract text between <p> and </p>, try this

perl -ne 'BEGIN{$/="</p>";$\="\n"}s/.*(<p>)/$1/&&print' < input-file > output-file

perl -n0l012e 'print for m|<p>.*?</p>|gs'

Upvotes: 0

Find specific tags in a HTML file

Answers (3)

Related Questions