iTextSharp Read Text From Single Layer of PDF

Question

Currently I am using a custom LocationTextExtractionStrategy to extract text from a PDF that returns a TextRenderInfo[]. I would like to be able to determine if a TextRenderInfo object (or PDFString, child of TextRenderInfo) appears in a specific layer. I am not sure if this is possible. To get the layers in a PDF, I am using:

Dictionary layers;
using (var pdfReader = new PdfReader(src))
{
    var newSrc = Path.Combine(["new file location"]);
    using (var stream = new FileStream(newSrc, FileMode.Create))
    {       
        PdfStamper stamper = new PdfStamper(pdfReader, stream);
        layers = stamper.GetPdfLayers();
        stamper.Close();
    }
    pdfReader.Close();
    src = newSrc;
}

To extract the text, I am using:

var textExtractor = new TextExtractionStrategy();
PdfTextExtractor.GetTextFromPage(pdfReader, pdfPageNum,textExtractor);
List results = textExtractor.Results;

Is there any way that I can check if the individual TextRenderInfo results exist within the layers obtained in the first code snippet. Any help would be much appreciated.

blagae · Accepted Answer

It is possible to get the contents from a single layer, but you'll have to jump through a few hoops to work it out. Specifically, you will have to recreate some of the logic that is provided by the PdfTextExtractor and PdfReaderContentParser.

public static String GetText(PdfReader reader, int pageNumber, int streamNumber) {
    var strategy = new LocationTextExtractionStrategy();
    var processor = new PdfContentStreamProcessor(strategy);

    var resourcesDic = pageDic.GetAsDict(PdfName.RESOURCES);

    // assuming you still need to extract the page bytes
    byte[] contents = GetContentBytesForPageStream(reader, pageNumber, streamNumber);

    processor.ProcessContent(contents, resourcesDic);
    return strategy.GetResultantText();
}

public static byte[] GetContentBytesForPageStream(PdfReader reader, int pageNumber, int streamNumber) {
    PdfDictionary pageDictionary = reader.GetPageN(pageNum);
    PdfObject contentObject = pageDictionary.Get(PdfName.CONTENTS);
    if (contentObject == null)
        return new byte[0];

    byte[] contentBytes = GetContentBytesFromContentObject(contentObject, streamNumber);
    return contentBytes;
}

public static byte[] GetContentBytesFromContentObject(PdfObject contentObject, int streamNumber) {
    // copy-paste logic from
    // ContentByteUtils.GetContentBytesFromContentObject(contentObject);
    // but in case PdfObject.ARRAY: only select the streamNumber you require
}

If you're specifically looking to just use PdfTextExtractor or PdfReaderContentParser, and ask the returned TextRenderInfo for the layer it's on, then I'm not sure it will be easily possible. There are a number of problems with that:

TextRenderInfo doesn't store that information, so you'd have to subclass it (which is possible)
you'd have to rewrite the logic that creates the TextRenderInfo objects. This is possible by registering custom IContentOperator objects for all text operators (Tj, TJ, ' and ") with the PdfTextExtractor or PdfReaderContentParser
the hardest part is that you have already lost layer information in ContentByteUtils.GetContentBytesFromContentObject - so you'd need to retain that somehow, which creates its own set of problems.

iTextSharp Read Text From Single Layer of PDF

Answers (1)

Related Questions