Salil
Salil

Reputation: 1811

Parsing PDF file using Apache PDFBox

I am trying to modify the contents of a PDF document using PDFBox. I used this example as it is, but observed that the text it my PDF file is getting split at character level (or worse). For example, a string,EM? what it is: gets split into:

COSString{E}
COSString{M?}
COSString{ }
COSString{w}
COSString{hat }
COSString{it }
COSString{is}
COSString{:}

(when checked by printing the cosString in the above mentioned code). As far as I can see, there are only Latin characters in the file, and the encoding is also ISO-8859-1. Any ideas?

Regards,

Salil

Upvotes: 1

Views: 1404

Answers (1)

Joel Westberg
Joel Westberg

Reputation: 2746

This is most likely a PDF formatting issue. That is how your particular PDF stores the text in order to get correct letter spacing or for kerning. This varies greatly from PDF to PDF, depending on how they were created.

Typically, I would suggest simply merging all the different tokens into one big content string.

Upvotes: 1

Related Questions