Reputation: 11700
I have some html files and want to extract the contents between some tags: The title of the page some tagged content here.
<p>A paragraph comes here</p>
<p>A paragraph comes here</p><span class="more-about">Some text here</span><p class="en-cpy">Copyright © 2012 </p>
I just want these tags: head, p
but as could be seen in the second paragraph, the last tag is which starts with p but is not my desires tag, and I don't want its content.
I used following script for extracting my desired text, but I can't filter out the tags such as the last one in my example.... How is it possible to extract just <p>
tags?
grep "<p>" $File | sed -e 's/^[ \t]*//'
I have to add that, the last tag (which I don't want to appear in the output) is right after one of my desired tags (as is in my example) and using grep command all the content of that line would be returned as output... (This is my problem)
Upvotes: 0
Views: 405
Reputation: 10662
Don't. Trying to use regex
to parse HTML is going to be painful. Use something like Ruby
and Nokogiri
, or a similar language + library that you are familiar with.
Upvotes: 3
Reputation: 6577
xmllint --html --xpath "//*[name()='head' or name()='p']" "$file"
If you're dealing with broken HTML you might need a different parser. Here's a "one-liner" basically the same using lxml
. Just pass the script your URL
#!/usr/bin/env python3
from lxml import etree
import sys
print('\n'.join(etree.tostring(x, encoding="utf-8", with_tail=False).decode("utf-8") for x in (lambda i: etree.parse(i, etree.HTMLParser(remove_blank_text=1, remove_comments=1)).xpath("//*[name()='p' or name()='head']"))(sys.argv[0])))
Upvotes: 0
Reputation: 19315
to extract text between <p> and </p>, try this
perl -ne 'BEGIN{$/="</p>";$\="\n"}s/.*(<p>)/$1/&&print' < input-file > output-file
or
perl -n0l012e 'print for m|<p>.*?</p>|gs'
Upvotes: 0