Edit Text style in pdf document

Question

I'm working on a C# Console application that's designed for editing the text style in existing PDF files, for instance change the text style to be in bold or italic or add font-family, change text color...

I used iTextSharp library, but encountered the following issues :

Thin spaces in PDF document are trimmed.
When extracting a text from an existing document, the text style is totally ignored (I mean fonts, bold, italic...)
Maths, images and texts of special formats are not read when extracting content from PDF file

Is there any other library, or any suggestion to edit the PDF file as described above?

mkl · Accepted Answer

Some words on the issues you encountered...

1 Thin spaces in PDF document are trimmed.

Thin spaces generally are generated by means of a horizontal coordinate shift. Unfortunately the same technique is used for kerning, i.e. to make adjacent characters look better. If such a horizontal shift is encountered while parsing a page, a parser must heuristically decide and sometimes is wrong. Such heuristics seem to fail in case of your document.

2 When extracting a text from an existing document , the text style is totally ignored (i mean fonts , bold , italic..etc)

That is a matter of the RenderListener you use. The listeners bundled with iText(Sharp) currently focus on the text. They can easily be extended to also transport font information.

You should be aware, though, that PDF does not know about bold, italic, etc. In case of documents in good quality, xxx and xxx bold are individual fonts, and in case of lesser quality documents, a poor man's bold may be generated by printing the glyphs twice with a minute offset, or a slanted appearances might be generated by means of an appropriate skewing transformation matrix.

3 Maths , images and texts of special formats are not read when extracting content from PDF file

If you have samples for this, please supply them here or on the itext-questions mailing list. Just to be sure, you have implemented a RenderListener which listens to image events when testing?

Thus, 1 is a general problem for which there may be better algorithms but which cannot be solved in a 100% secure way. 2 merely requires you to implement an appropriate RenderListener based on one of the existing text-only ones; actually there was quite some talk on creating a RichTextExtractionStrategy for iText. 3 has to be inspected more intensely, though.

In essence, iText(Sharp) is not the only PDF library with text parsing capabilities, and each of them surely has its own advantages respectively. It does, though, supply a framework which can be used to retrieve as much information from the document text style as possible.

I'm working on a C# Console application that's designed for editing the text style in existing PDF files , for instance change the text style to be in bold or italic or add font-family , change text color...etc .

That is quite a feat, considering that different fonts or different styles in the same font family may have significantly different widths. This may result in ugly looks or in the need to reflow text, which is something PDF is not really good for.

Edit Text style in pdf document

Answers (1)

Related Questions