Elliott
Elliott

Reputation: 5609

PdfBox removes spaces when text is not oriented correctly

The code below is a simple demonstration of PdfBox found on the internet:

public class PDFReader {
    public static void main(String args[]) {
        PDFTextStripper pdfStripper = null;
        PDDocument pdDoc = null;
        COSDocument cosDoc = null;
        File file = new File("C:/my.pdf");
        try {
            PDFParser parser = new PDFParser(new FileInputStream(file));
            parser.parse();
            cosDoc = parser.getDocument();
            pdfStripper = new PDFTextStripper();
            pdDoc = new PDDocument(cosDoc);
            pdfStripper.setStartPage(1);
            pdfStripper.setEndPage(5);
            String parsedText = pdfStripper.getText(pdDoc);
            System.out.println(parsedText);
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }
}

The code works perfectly fine unless the text is oriented incorrectly. For example: Given a pdf file with the text: "The quick brown fox jumped over the lazy dog." A pdf with the text upside down will be rendered as Thequickbrownfoxjumpedoverthelazydog And if rotated 90 degrees the text looks like this:

T
h
e
q
u
i
c
k  
etc. 

Is there a way to detect orientation prior to stripping the text and then adjusting it to preserve the spacing in the original document?

Upvotes: 0

Views: 623

Answers (1)

Elliott
Elliott

Reputation: 5609

ItFreak's comment above pointed me to a stackoverflow question with a comment that solved the problem. All that was necessary was to set the PdfStripper as follows:

stripper.setSortByPosition(true)

Once I did this, all the spacing was restored on both the upsidedown and 90 degree rotated images.

Upvotes: 1

Related Questions