IAmYourFaja
IAmYourFaja

Reputation: 56914

Read PDF text and/or all content

I have a scenario where I need a Java app to be able to extract content from a PDF file in one of 2 modes: TEXT_ONLY or ALL. In text mode, only visible text ("visible" as if a human being was reading the PDF) is read out into strings. In all mode, all content (text, images, etc.) is read out of the file.

For instance, if a PDF file was to have 1 page in it, and that page had 3 paragraphs of contiguous text, and was word-wrapping 2 images, then TEXT_ONLY would extract all 3 paragraphs, and ALL would extract all 3 paragraphs and both images:

while(page.hasMoreText())
    textList.add(page.nextTextChunk());

if(allMode)
    while(page.hasMoreImages())
        imageList.add(page.nextImage());

I know Apache Tika uses PDFBox under the hood, but am worried that this kind of functionality is shaded/prohibited by Tika (in which case, I probably need to do this directly from PDFBox).

So I ask: is this possible, and if so, which library is more appropriate for me to use? Am I going about this entirely the wrong way? Any pitfalls/caveats I am not considering here?

Upvotes: 1

Views: 3168

Answers (1)

mkl
mkl

Reputation: 95918

To expound some aspects of why @markStephens points you towards some resources giving some background on PDF.

In text mode, only visible text ("visible" as if a human being was reading the PDF) is read out into strings.

Your definition "visible" as if a human being was reading the PDF is not yet very well-defined:

  • Is text 1 pt in size visible? When zooming in, a human can read it; in standard magnification not, though. Which size would be the limit?

  • Is text in RGB (128, 129, 128) in a background of (128, 128, 128) visible? How different have the colors to be?

  • Is text displayed in some white noise pattern on a background of some other white noise pattern visible? How different have patterns to be?

  • Is text only partially on-screen visible? If yes, is one visible pixel enough? And what about some character 'I' in a giant size where the visible page area fits into the dot on the letter?

  • What about text covered by some annotation which can easily be moved, probably even by some automatically executed JavaScript code in the file?

  • What about text in some optional content group only visible when printing?

*...

I would expect most available PDF text parsing libraries to ignore all these circumstances and extract the text, at most respecting a crop box. In case of images with added, invisible OCR'ed text the extraction of that text in general is desired.

For instance, if a PDF file was to have 1 page in it, and that page had 3 paragraphs of contiguous text, and was word-wrapping 2 images, then TEXT_ONLY would extract all 3 paragraphs, and ALL would extract all 3 paragraphs and both images:

PDF (in general) does not know about paragraphs, just some groups of glyphs positioned somewhere on the page. Recognizing paragraphs is a task which cannot be guaranteed to work properly as there are heuristics at work. If, furthermore, you have multicolumn text with an irregular separation, maybe even some image in between (making it hard to decide whether there are two columns divided by the image or whether there is one column with an integrated image), you can count on recognition of the text flow let alone text elements like paragraphs, sections, etc. to fail miserably.

If your PDFs are either properly tagged or all generated by a tool chain for which patterns in the created PDF content streams betray text structures, you may be more lucky. In case of the latter, though, your solution would have to be custom-made for that tool chain.

but am worried that this kind of functionality is shaded/prohibited by Tika (in which case, I probably need to do this directly from pdfBox).

There you point towards another point of interest: PDFs can be marked that text extraction is forbidden while they otherwise can be displayed by anyone. While technically PDFs marked like that can be handled just like documents without that mark with just one decoding step (essentially they are encrypted with a publicly known password), doing so is clearly acting against the declared intention of the author and violating his copyright.

So I ask: is this possible, and if so, which library is more appropriate for me to use? Am I going about this entirely the wrong way? Any pitfalls/caveats I am not considering here?

As long as you expect 100% accuracy for generic input, you should reconsider your architecture.

If the PDFs are all you have and a solution as effective is possible is OK, on the other hand, there are multiple possible libraries for you, iText, and PDFBox to name but two while there are more. Which is best for you depends on more factors, e.g. on whether you need some generic solution or all PDFs are created by a tool chain as above.

In any case you'll have to do some programming yourself, though, to fine-tune them for your use case.

Upvotes: 1

Related Questions