Reputation: 1221
We're using the PDFNet library to extract the contents of a PDF file. One of the things we need to do is extract the URLs in the PDF. Unfortunately, as you scan through the elements in the file, you get the URL in pieces, and it's not always clear which piece goes with which.
What is the best way to get complete URLs from PDFNet?
Upvotes: 1
Views: 553
Reputation: 1221
Links are stored on the pages as annotations. You can do something like the following code to get the URI from the annotation. The try/catch block is there because if any of the values are missing, they still return an Obj object, but you cannot call any method on it without it throwing.
Also, be aware that not everything that looks like a link is the same. We created two PDFs from the same Word file. The first we created with print to PDF. The second we created from within Acrobat.
The links in both files work fine with Acrobat Reader, but only the second file has annotations that PDFNet can see.
Page page = doc.GetPage(1);
for (int i = 1; j < page.GetNumAnnots(); j++) {
Annot annot = page.GetAnnot(i);
if (!annot.IsValid())
continue;
var sdf = annot.GetSDFObj();
string uri = ParseURI(sdf);
Console.WriteLine(uri);
}
private string ParseURI(pdftron.SDF.Obj obj) {
try {
if (obj.IsDict()) {
var aDictionary = obj.Find("A").Value();
var uri = aDictionary.Find("URI").Value();
return uri.GetAsPDFText();
}
} catch (Exception ) {
return null;
}
return null;
}
Upvotes: 1