Wagner Franklin
Wagner Franklin

Reputation: 9

How to match text in quotation marks between p tags (regex) - Calibre Search and Replace

I need to do some formatting on the text below and to do so I need to match only the text between quotes inside p tags (<p> and </p>).

This text below is an example:

<div class="vung_doc" id="vung_doc">
<p>Volume 1: The Mysterious Driver
</p>
<p>He picked up the pistol from the pool of blood and pointed it at the
person coming towards him, screaming, "I'll kill you!"
</p>
<p>No matter how many times he pressed the trigger, the rounds didn't budge.
The approaching figure mockingly spoke, "Haha, what a scene! The Great Detective Song Lang, 
actually killing his superior and partner with his very own hands! I can't wait to see the 
headlines in the newspapers tomorrow!"
</p>

I need only to match "I'll kill you!" and "Haha, what a scene! The Great Detective Song Lang, actually killing his superior and partner with his very own hands! I can't wait to see the headlines in the newspapers tomorrow!"

But most of the regex I tried got all the text between quotes *"(.\*?)"*, all the text between the p tags *\<p\>(.|\\n)\*?\<\\/p\>* or something in between.

I use Calibre search and replace, so only one line of regex. I use ReExr to test the expressions.

Upvotes: 0

Views: 133

Answers (1)

Gilles Qu&#233;not
Gilles Qu&#233;not

Reputation: 184955

Don't use regex to parse HTML

you cannot, must not parse any structured text like XML/HTML with tools designed to process raw text lines. If you need to process XML/HTML, use an XML/HTML parser. A great majority of languages have built-in support for parsing XML and there are dedicated tools like xidel, xmlstarlet or xmllint if you need a quick shot from a command line shell.

With xidel:

xidel -e '//p/extract(text(),"&quot;(.+)&quot;",1,"s")[.]' file

Credits to Reino.

With xidel and grep:

xidel -e '//p' file | grep -oP '"\K[^"]+' file

Output

I'll kill you!
Haha, what a scene! The Great Detective Song Lang, actually killing his superior and partner with his very own hands! I can't wait to see the headlines in the newspapers tomorrow!

Here, I use grep regex only on the text part.

Upvotes: 1

Related Questions