cursez
cursez

Reputation: 7

How to substitude all "\t" (tab characters) with white space in a PDF

Hello i am trying to convert a pdf book about programming to mobi format with Calibre.

The problem I am facing is that the code blocks inside the converted version completely lose indentation.

I managed with a regular expression to correctly indent the lines that where indented using white spaces. I did so transforming every two white spaces to two non-breaking-spaces.

Some of the code blocks unfortunately are indented using the tab character, so the regular expression is not working in these cases.

I came to realize that during the conversion from pdf to mobi there is an intermediate step in which the pdf is converted to hmtl and there is when the tab information is lost because no special tag is being generated to carry this information.

So i think the best solution is to edit the very pdf itself and replace all the tab characters(\t) to two white spaces (\s\s). This way the regular expression i mentioned before will work for all the code block references and the code will be indented properly.

but i have no idea which software to use that has this functionality of substituting pdf elements.

Upvotes: 0

Views: 2906

Answers (1)

KenS
KenS

Reputation: 31207

I doubt that the 'tabs' are contained in the PDF as tabs. The 'tab' character (0x04 in ASCII) has no special significance in PDF, and in particular it does not move the current point, it simply draws a glyph. As a result, if you do (A\tB) what you will see when the PDF is rendered is 'AB'. Or 'A*B' where the * is some other character you didn't expect (often a square)

So you would probably actually have to convert current point movement operators into white space drawing There's no realistic way that can be automated, since no tool can tell where a movement was a 'tab' and where it was a reposition.

So you will need to do it manually.

The challenge here is that the page content stream is likely to be compressed, so the first thing you will have to do is decompress the PDF. There are a number of tools which will do this for you, MuPDF is one, I think pdftk is another.

Then you will need to locate the position where you want to insert space, this could be challenging, as the font may be re-encoded to something other than ASCII so it may be hard to identify the correct position. Once you've done that, you can insert the space(s) you want into the text strings, again bearing in mind that the font in use may be re-encoded, and subset. This means that a space may not be 0x20 and indeed the font may not even contain a space glyph. And of course you need to remove the operations to reposition the current point.

Finally, after you've modified the contents, you need to remember that PDF is a binary format, and the xref table contains the position of every element in the file. If you've edited the file its likely that you will have altered the length of one or more elements, which will change the offset of any following elements, so you will need to recalculate those and update the xref table.

I suspect you are going to find it easier to modify the conversion from PDF to HTML, or modify the HTML, than to try and alter the PDF file.

Upvotes: 2

Related Questions