Reputation: 43
Recently,I have to index pdf into ElasticSearch and using pdfbox to extract text from pdf, however I got wrong characters encoding like this
Ýëĭ2ĈjŬj§ė¥
1 ŋ?nij"2$ 2016£ 2Ú 5Õ,”Òªj§?ně#ij"2ě
^ë2ļŘœ A$j§?n 2016£ě#ëÖĭ2Ĉļê
2 èÅŋ?n$ 2016£ 2Ú 6ÕöĿS¿ ĿS¿ ĿS
Õ¿ ĿSÖ¿ eöĿS&غĨĘ
http://www.sse.com.cnLćĈ
A$j§Ýëĭ2ĈŘĐ
My code is exactly the same as this page says here. I try pdfbox lib version from 0.8.x to 2.0.x, but it still can not work.
Any help or advice will be grateful!
Upvotes: 0
Views: 1170
Reputation: 43
I got answer from @Tilman comment.
See pdfbox.apache.org/1.8/faq.html#notext and the answer below too.
Upvotes: 1