Stephan
Stephan

Reputation: 8090

Java: how to write out PDF to a text file?

When I open a PDF file and write the content to a text file the content from the text file is messed up. I think it's because of the encoding. From what I understand the JVM sets the default character set to Cp1252 (because I'm running on Windows XP). I've changed the default character set but with no results (System.setProperty("file.encoding", "ISO-8859-1");)

Any ideas?

Upvotes: 1

Views: 2774

Answers (7)

cemerick
cemerick

Reputation: 5916

Our PDFTextStream library provides comprehensive support for diacriticals, as well as all character sets defined in the Unicode standard (including Chinese, Japanese, and Korean characters, in both horizontal and vertical writing modes). You might find that it extracts those diacriticals properly where other tools do not.

There are circumstances where a character, when extracted to text, will not appear to be the same as when it is displayed by a PDF reader like Acrobat -- this is most often the case when the text in question is rendered using an image-based font (which obviously doesn't convert directly to text, and would require an OCR process in order to derive the proper accented character(s)).

Upvotes: 1

FRotthowe
FRotthowe

Reputation: 3662

Using the iText helper class PdfTextExtractor should work fine. Just check that you're using the right encoding when writing the file to disk:

OutputStreamWriter out = new OutputStreamWriter( new FileOutputStream(file),"ISO-8859-1") );

Upvotes: 2

peter.murray.rust
peter.murray.rust

Reputation: 38073

You have to use a specialised package. Two that I have used are pdftotext (http://en.wikipedia.org/wiki/Pdftotext) and PDFBox (http://incubator.apache.org/pdfbox/). Even with a package you cannot always gurantee success as some PDF-writing tools are poor quality and generate poor PDF.

Upvotes: 1

Bobby
Bobby

Reputation: 1611

The reason that iText is not reading all the letters correctly may be due to the encoding used for the font. You could declare the font like:

BaseFont bf = BaseFont.createFont(BaseFont.HELVETICA, BaseFont.CP1252, BaseFont.EMBEDDED);

where BaseFont.CP1252 is the encoding used. Be advised that some fonts do not have support for all types of encodings.

Upvotes: 4

setzamora
setzamora

Reputation: 3640

You can try JavaPDF. It has an API for you to do the job. You can invoke the method extractTextFromPage(int pageIndex) from the PDFReader class.

Upvotes: 2

i2ijeya
i2ijeya

Reputation: 16430

iText is an API for creating pdf from scratch, But inorder to read and edit the existing file, you can look at the following link http://www.lowagie.com/iText/

Upvotes: 1

bschandramohan
bschandramohan

Reputation: 2006

PDF is a binary file and hence you cannot read it as text file. You will have to hunt for some third party libraries to read the PDF contents.

Upvotes: 0

Related Questions