Reputation: 79
I am using iTextSharp to read a pdf document and it is getting read successfully. Now I want to get Tags from a pdf document but I don't know how to get tags using iTextSharp.
Code is given below
class Program
{
static void Main(string[] args)
{
var result = pdfText(@"C:\Users\Purelogics\Desktop\tranfer\tagged.pdf");
}
public static string pdfText(string path)
{
PdfReader reader = new PdfReader(path);
//This line return true that means this document is tagged
bool isTagged = reader.IsTagged();
var metadeta = reader.Metadata;
IList<Dictionary<string, object>> bookmarks = SimpleBookmark.GetBookmark(reader);
string text = string.Empty;
var title = reader.Info["Title"];
for (int page = 1; page <= reader.NumberOfPages; page++)
{
var object1 = reader.GetPdfObject(page);
text += PdfTextExtractor.GetTextFromPage(reader, page);
}
reader.Close();
return text;
}
}
Upvotes: 1
Views: 2420
Reputation: 9012
It depends on what you want to do with this tagging exactly.
Let's assume you want to extract everything tagged as \P
(paragraphs)
First you need to get the structuretree root of a document
File inputFile = new File("input.pdf");
PdfDocument pdfDocument = new PdfDocument(new PdfReader(inputFile));
IStructureNode root = pdfDocument.getStructTreeRoot();
Now that you have the structuretree root, you can crawl the tree. Either via a stack, or recursion. In this example, I am using recursion.
private Set<PdfNameMCIDGroup> find(PdfDocument pdfDocument, IStructureNode node){
Set<PdfNameMCIDGroup> out = new HashSet<>();
PdfName role = node.getRole();
if(markedRoles.contains(role))
out.add(mark(pdfDocument, node));
else
for(IStructureNode kid : node.getKids())
out.addAll(find(pdfDocument, kid));
return out;
}
And this is what the mark
method looks like
private PdfNameMCIDGroup mark(PdfDocument pdfDocument, IStructureNode node){
PdfNameMCIDGroup out = new PdfNameMCIDGroup(0);
Set<PdfMcr> leaves = new HashSet<>();
Stack<IStructureNode> stk = new Stack<>();
stk.push(node);
while(!stk.isEmpty()){
IStructureNode tmp = stk.pop();
if(tmp instanceof PdfMcr)
leaves.add((PdfMcr) tmp);
else
for(IStructureNode kid : tmp.getKids())
stk.push(kid);
}
// mcids
for(PdfMcr mcr : leaves){
int mcid = mcr.getMcid();
int pageNr = pdfDocument.getPageNumber(mcr.getPageObject());
out.mcids.add(mcid);
out.pageNrs.add(pageNr);
}
return out;
}
The idea behind these methods is that find
will traverse the tree hierarchy. And mark
will handle the node as soon as it matches one of the roles.
This code gets us halfway to a solution.
We now have the marked content IDs of whatever content is marked with \P
We now need to actually extract the rendering instructions that match these IDs.
For this, you need to write your own IEventListener
. This class can be passed to a CanvasProcessor
, and will then get notified everytime the parser has finished processing an instruction.
public void eventOccurred(IEventData iEventData, EventType eventType) {
if(eventType == EventType.RENDER_TEXT)
{
TextRenderInfo tri = (TextRenderInfo) iEventData;
int mcID = tri.getMcid();
// this is where you can do something with it
}
}
Upvotes: 2