Reputation: 9
I need to do some formatting on the text below and to do so I need to match only the text between quotes inside p tags (<p> and </p>).
This text below is an example:
<div class="vung_doc" id="vung_doc">
<p>Volume 1: The Mysterious Driver
</p>
<p>He picked up the pistol from the pool of blood and pointed it at the
person coming towards him, screaming, "I'll kill you!"
</p>
<p>No matter how many times he pressed the trigger, the rounds didn't budge.
The approaching figure mockingly spoke, "Haha, what a scene! The Great Detective Song Lang,
actually killing his superior and partner with his very own hands! I can't wait to see the
headlines in the newspapers tomorrow!"
</p>
I need only to match "I'll kill you!"
and "Haha, what a scene! The Great Detective Song Lang, actually killing his superior and partner with his very own hands! I can't wait to see the headlines in the newspapers tomorrow!"
But most of the regex I tried got all the text between quotes *"(.\*?)"*
, all the text between the p tags *\<p\>(.|\\n)\*?\<\\/p\>*
or something in between.
I use Calibre search and replace, so only one line of regex. I use ReExr to test the expressions.
Upvotes: 0
Views: 133
Reputation: 184955
regex
to parse HTML
you cannot, must not parse any structured text like XML/HTML with tools designed to process raw text lines. If you need to process XML/HTML, use an XML/HTML parser. A great majority of languages have built-in support for parsing XML and there are dedicated tools like xidel
, xmlstarlet
or xmllint
if you need a quick shot from a command line shell.
xidel
:xidel -e '//p/extract(text(),""(.+)"",1,"s")[.]' file
Credits to Reino.
xidel
and grep
:xidel -e '//p' file | grep -oP '"\K[^"]+' file
I'll kill you!
Haha, what a scene! The Great Detective Song Lang, actually killing his superior and partner with his very own hands! I can't wait to see the headlines in the newspapers tomorrow!
Here, I use grep
regex
only on the text part.
Upvotes: 1