Christian Eric Paran
Christian Eric Paran

Reputation: 1010

iText reading multicolumned PDF document

Reading multicolumned PDF document

When iText read the PDF (Extract a page content into a string variable), then the content would be fixed there by:

reader = new PdfReader(getResources().openRawResource(R.raw.resume1));
original_content = PdfTextExtractor.getTextFromPage(reader, 2);
String sub_content = original_content.trim().replaceAll(" {2,}", " ");
sub_content = sub_content.trim().replaceAll("\n ", "\n");
sub_content = sub_content.replaceAll("(.+)(?<!\\.)\n(?!\\W)", "$1 "); 

if the document is 1 column only but if the document has multicolumn, it would extract the document 1 per line. it would combine left and right column.

I am using this as a sample PDF this is from START QA document.

How to read a multicolumned PDF document?

Upvotes: 0

Views: 4166

Answers (1)

mkl
mkl

Reputation: 95918

There are two different approaches to this problem, and the choice which to use depends on the PDF itself.

  1. If strings in the page content of the PDF in questions already are in the desired order: Instead of the LocationTextExtractionStrategy implicitly used by the overload of PdfTextExtractor.getTextFromPage you use, explicitly use the SimpleTextExtractionStrategy; in your case:

    original_content = PdfTextExtractor.getTextFromPage(reader, 2, new SimpleTextExtractionStrategy());
    
  2. If the strings in the page content of the PDF in question are not in the desired order: Instead of the LocationTextExtractionStrategy implicitly used by the overload of PdfTextExtractor.getTextFromPage you use, explicitly wrap one such strategy in a FilteredTextRenderListener restricting it to receive text for the area of a single column only; in your case:

    Rectangle left = new Rectangle(0, 0, 306, 792);
    Rectangle right = new Rectangle(306, 0, 612, 792);
    RenderFilter leftFilter = new RegionTextRenderFilter(left);
    RenderFilter rightFilter = new RegionTextRenderFilter(right);
    [...]
    TextExtractionStrategy strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), leftFilter);
    original_content = PdfTextExtractor.getTextFromPage(reader, 2, strategy);
    originalContent += " ";
    strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), rightFilter);
    original_content += PdfTextExtractor.getTextFromPage(reader, 2, strategy);
    

Upvotes: 3

Related Questions