Reputation: 61
I know this question has been asked before, but none of the previous responses have worked for me. I have a PDF that I'm trying to convert in Calibre. In the conversion process, I want to get rid of the headers and footers, which look like these:
<hr/>
<a name=9></a>viii<br>
<i>Introduction</i><br>
<hr/>
<a name=10></a><i>Introduction</i><br>
ix<br>
I used the following regex, which worked beautifully to select all of these instances:
(?s)<hr/>(.*?)</a>(.*?)<br>(.*?)<br>
HOWEVER, when there is a chapter title, the PDF code says this:
<hr/>
<a name=8></a><a href="index.html#6">INTRODUCTION</a><br>
which is also picked up by my regex. I want to alter my code to ignore the chapter titles. I have tried dozens of combinations replacing the
(.*?)
with things like
[^index]
^((?!index).)*$
/(?s)^((?!index).)*$/
I have also tried each of these with href, =, and " instead of "index," but none of these codes pick up anything. Any ideas what I need to change in my code so I can remove the headers and footers without removing the chapter titles? Thank you in advance!
Upvotes: 1
Views: 133
Reputation:
It's not all that hard. Assuming that your HTML is always simple and doesn't contain anything tricky like <
or >
in quotes, just add this:
(?:<a[^>]+href=[^>]+>.*?</a>)?
immediately after the </a>
in your current regex. That bit says that the header may or may not be there, and either way, don't pick it up (the noncapturing group, (?:)
.)
Upvotes: 1