Bilal Malik
Bilal Malik

Reputation: 79

How to get Tags from pdf document in c#

I am using iTextSharp to read a pdf document and it is getting read successfully. Now I want to get Tags from a pdf document but I don't know how to get tags using iTextSharp.

Code is given below

class Program
{
    static void Main(string[] args)
    {
        var result = pdfText(@"C:\Users\Purelogics\Desktop\tranfer\tagged.pdf");
    }

    public static string pdfText(string path)
    {
        PdfReader reader = new PdfReader(path);
        //This line return true that means this document is tagged
        bool isTagged = reader.IsTagged();
        var metadeta = reader.Metadata;
        IList<Dictionary<string, object>> bookmarks = SimpleBookmark.GetBookmark(reader);
        string text = string.Empty;
        var title = reader.Info["Title"];
        for (int page = 1; page <= reader.NumberOfPages; page++)
        {
            var object1 = reader.GetPdfObject(page);
            text += PdfTextExtractor.GetTextFromPage(reader, page);
        }
        reader.Close();
        return text;
    }
}

Upvotes: 1

Views: 2420

Answers (1)

Joris Schellekens
Joris Schellekens

Reputation: 9012

It depends on what you want to do with this tagging exactly. Let's assume you want to extract everything tagged as \P (paragraphs)

First you need to get the structuretree root of a document

File inputFile = new File("input.pdf");
PdfDocument pdfDocument = new PdfDocument(new PdfReader(inputFile));
IStructureNode root = pdfDocument.getStructTreeRoot();

Now that you have the structuretree root, you can crawl the tree. Either via a stack, or recursion. In this example, I am using recursion.

private Set<PdfNameMCIDGroup> find(PdfDocument pdfDocument, IStructureNode node){
    Set<PdfNameMCIDGroup> out = new HashSet<>();

    PdfName role = node.getRole();
    if(markedRoles.contains(role))
        out.add(mark(pdfDocument, node));
    else
        for(IStructureNode kid : node.getKids())
            out.addAll(find(pdfDocument, kid));

    return out;
}

And this is what the mark method looks like

private PdfNameMCIDGroup mark(PdfDocument pdfDocument, IStructureNode node){
    PdfNameMCIDGroup out = new PdfNameMCIDGroup(0);

    Set<PdfMcr> leaves = new HashSet<>();
    Stack<IStructureNode> stk = new Stack<>();
    stk.push(node);
    while(!stk.isEmpty()){
        IStructureNode tmp = stk.pop();
        if(tmp instanceof PdfMcr)
            leaves.add((PdfMcr) tmp);
        else
            for(IStructureNode kid : tmp.getKids())
                stk.push(kid);
    }

    // mcids
    for(PdfMcr mcr : leaves){
        int mcid = mcr.getMcid();
        int pageNr = pdfDocument.getPageNumber(mcr.getPageObject());
        out.mcids.add(mcid);
        out.pageNrs.add(pageNr);
    }

    return out;
}

The idea behind these methods is that find will traverse the tree hierarchy. And mark will handle the node as soon as it matches one of the roles.

This code gets us halfway to a solution. We now have the marked content IDs of whatever content is marked with \P

We now need to actually extract the rendering instructions that match these IDs.

For this, you need to write your own IEventListener. This class can be passed to a CanvasProcessor, and will then get notified everytime the parser has finished processing an instruction.

public void eventOccurred(IEventData iEventData, EventType eventType) {
    if(eventType == EventType.RENDER_TEXT)
    {
        TextRenderInfo tri = (TextRenderInfo) iEventData;
        int mcID = tri.getMcid();
        // this is where you can do something with it
    }
}

Upvotes: 2

Related Questions