hEngi
hEngi

Reputation: 915

PDFBox character bad characters in the pdf to string conversation

I am using PDFBox 1.8.4 to convert PDF to string. For example my pdf contains : Pólya, G. and G. Szegő, The output is : Po´lya, G. and G. Szego˝

Is there any way to solve that problem? (Yeah i know i can change with replaceAll("o'","ó"))

   PDDocument doc = PDDocument.load(path);
   PDFTextStripper strp = new PDFTextStripper("UTF-8");
   System.out.println(strp.getText(doc));

All suggestions are welcomed!

Edit 1: PDF_Example

Upvotes: 1

Views: 1153

Answers (2)

mkl
mkl

Reputation: 95918

The document presented by the OP contains e.g. this line

Attila Gobi, Zalan Sz}ugyi and Tamas Kozsik

which quite likely is a sample of the issue he has identified.

Looking at the page content stream, though,

[(A)32(ttila)-384(G\023)575(obi,)-383
(Zal)8(\023)567(an)-383(Sz)-32(})607(ugyi)-384(and)-383
(T)96(am)8(\023)567(as)-384(Kozsik)]TJ

one sees that e.g. in (G\023)575(obi,) ó is created by first drawing the ´ (\023), then going back the width of that glyph (575), and then drawing the o.

Thus, you do have these two glyphs ´ and o printed in the same location, not a single glyph ó.

PDFBox PDFTextStripper currently does not combine characters printed at the same location other than dropping the identical glyph drawn twice at about the same location.

Thus aside from replaceAll("o'","ó") as mentioned by the OP, one can also extend the PDFTextStripper to combine certain glyphs, either early in its method processTextPosition or late in writeString(String text, List<TextPosition> textPositions).

Upvotes: 3

Nader Ghanbari
Nader Ghanbari

Reputation: 4300

Maybe the problem is with the PDF file encoding (i.e. the encoding is not UTF-8).

As a hint look at this question on PDFBox docuemntaion FAQ.

Upvotes: 1

Related Questions