DummerIdiot
DummerIdiot

Reputation: 11

get Horizontal Text with PDFBox

I want to use PDF Box to automatically extract and output the text of a PDF file. So far I can output all the text reasonably formatted by .split("\\n|\\s{2,}"), but the PDF also has vertical text, which my program then reads out in something like this:

normal text, vertical text: 
Lorem ipsum dolor sit amet, c on s et e tu r sa d ip s ci n g el i tr . L or e m

In Adobe Acrobat Pro, I can go to "Tags -> Document -> Sect" to see the sect of my vertical text, which contains a "P" and three "Span" tags that together make up the vertical text.

On top of that, the PDF doesn't seem to have any regions or AcroForm defined. The goal is to be able to read any PDF variants through it, no matter what content they have.

Is there a simple solution to output the vertical text in a reasonable formatting?

Upvotes: 1

Views: 45

Answers (0)

Related Questions