user2702239
user2702239

Reputation: 61

Regex to match all text on multiple lines unless it contains a specific string?

I know this question has been asked before, but none of the previous responses have worked for me. I have a PDF that I'm trying to convert in Calibre. In the conversion process, I want to get rid of the headers and footers, which look like these:

<hr/>
<a name=9></a>viii<br>
<i>Introduction</i><br>

<hr/>
<a name=10></a><i>Introduction</i><br>
ix<br>

I used the following regex, which worked beautifully to select all of these instances:

(?s)<hr/>(.*?)</a>(.*?)<br>(.*?)<br>

HOWEVER, when there is a chapter title, the PDF code says this:

<hr/>
<a name=8></a><a href="index.html#6">INTRODUCTION</a><br>

which is also picked up by my regex. I want to alter my code to ignore the chapter titles. I have tried dozens of combinations replacing the

(.*?) 

with things like

[^index] 
^((?!index).)*$ 
/(?s)^((?!index).)*$/ 

I have also tried each of these with href, =, and " instead of "index," but none of these codes pick up anything. Any ideas what I need to change in my code so I can remove the headers and footers without removing the chapter titles? Thank you in advance!

Upvotes: 1

Views: 133

Answers (1)

anon
anon

Reputation:

It's not all that hard. Assuming that your HTML is always simple and doesn't contain anything tricky like < or > in quotes, just add this:

(?:<a[^>]+href=[^>]+>.*?</a>)?

immediately after the </a> in your current regex. That bit says that the header may or may not be there, and either way, don't pick it up (the noncapturing group, (?:).)

Upvotes: 1

Related Questions