parsing pdf with pdf-box shows unknown characters

Question

When I try to parse a pdf file with pdf box in java which is generated with cups pdf, showing junk characters. But it works perfectly with common pdfs I checked font cups pdf shows FreeMono_00.ttf (but i didn't se such a font anywhere) and working pdf shows ArialMT.

Anything I want to do differently for parsing the pdfs generated using cups-pdf.

below is the code I'm using for parsing.

parser = new PDFParser(new FileInputStream(File file));
parser.parse();
           COSDocument  cosDoc = parser.getDocument();
           PDFTextStripperpdfStripper = new PDFTextStripper();
          PDDocument  pdDoc = new PDDocument(cosDoc);
 String parsedText = pdfStripper.getText(pdDoc);

output is getting like this )LOH1DPHDVGW[W 6XEMHFWVXEMHFWVVDPSOH 0HVVDJHVHQGLQJGHWDLOVDORQJZLWKSULQWILOH 8VHU1DPH$EGXOUD]DN30 8VHU,'D#DFRP

just copy paste also gives like this

Ed Staub · Accepted Answer

I'm only repeating what I've read... I'm inexperienced here. If there were more mavens answering PDF/PDFBox questions, I'd wait for one to answer.

I believe the font either doesn't contain Unicode tables at all, or has been embedded in the document without Unicode tables. If the text seems to be a simple substitution cipher for a single given document, that would tend to confirm this.

If the font is embedded, I think sometimes only an extract of the glyphs you are actually using is embedded. That's likely here, since the font is not installed on the system (you said), and the original FreeMono font is large - over 4000 glyphs. In this case, I fear that the correspondence between character and glyph may be document-dependent - but I'm speculating.

parsing pdf with pdf-box shows unknown characters

Answers (1)

Related Questions