CODEWITHSUNDEEP

character-encodingpdfbox

Reputation: 43

got wrong characters encoding using pdfbox to extract text from pdf

Recently,I have to index pdf into ElasticSearch and using pdfbox to extract text from pdf, however I got wrong characters encoding like this

Ýëĭ2ĈjŬj§ė¥ 
1 ŋ?nĳ"2$ 2016£ 2Ú 5Õ,”Òªj§?ně#ĳ"2ě
^ë2ļŘœ A$j§?n 2016£ě#ëÖĭ2Ĉļê    
2 èÅŋ?n$ 2016£ 2Ú 6ÕöĿS¿    ĿS¿ ĿS
Õ¿  ĿSÖ¿  eöĿS&ØºĨĘ
http://www.sse.com.cnLćĈ
A$j§Ýëĭ2ĈŘĐ

My code is exactly the same as this page says here. I try pdfbox lib version from 0.8.x to 2.0.x, but it still can not work.

Any help or advice will be grateful!

Upvotes: 0

Views: 1179

Answers (1)

Reputation: 43

I got answer from @Tilman comment.

See pdfbox.apache.org/1.8/faq.html#notext and the answer below too.

Upvotes: 1

Related Questions