Reputation: 1811
I am trying to modify the contents of a PDF document using PDFBox. I used this example as it is, but observed that the text it my PDF file is getting split at character level (or worse). For example, a string,EM? what it is:
gets split into:
COSString{E}
COSString{M?}
COSString{ }
COSString{w}
COSString{hat }
COSString{it }
COSString{is}
COSString{:}
(when checked by printing the cosString
in the above mentioned code). As far as I can see, there are only Latin characters in the file, and the encoding is also ISO-8859-1. Any ideas?
Regards,
Salil
Upvotes: 1
Views: 1404
Reputation: 2746
This is most likely a PDF formatting issue. That is how your particular PDF stores the text in order to get correct letter spacing or for kerning. This varies greatly from PDF to PDF, depending on how they were created.
Typically, I would suggest simply merging all the different tokens into one big content string.
Upvotes: 1