Reputation: 5609
The code below is a simple demonstration of PdfBox found on the internet:
public class PDFReader {
public static void main(String args[]) {
PDFTextStripper pdfStripper = null;
PDDocument pdDoc = null;
COSDocument cosDoc = null;
File file = new File("C:/my.pdf");
try {
PDFParser parser = new PDFParser(new FileInputStream(file));
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdDoc = new PDDocument(cosDoc);
pdfStripper.setStartPage(1);
pdfStripper.setEndPage(5);
String parsedText = pdfStripper.getText(pdDoc);
System.out.println(parsedText);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
The code works perfectly fine unless the text is oriented incorrectly. For example: Given a pdf file with the text: "The quick brown fox jumped over the lazy dog." A pdf with the text upside down will be rendered as Thequickbrownfoxjumpedoverthelazydog
And if rotated 90 degrees the text looks like this:
T
h
e
q
u
i
c
k
etc.
Is there a way to detect orientation prior to stripping the text and then adjusting it to preserve the spacing in the original document?
Upvotes: 0
Views: 623
Reputation: 5609
ItFreak's comment above pointed me to a stackoverflow question with a comment that solved the problem. All that was necessary was to set the PdfStripper as follows:
stripper.setSortByPosition(true)
Once I did this, all the spacing was restored on both the upsidedown and 90 degree rotated images.
Upvotes: 1