Reputation: 1570
I am using the following code to parse text from a PDF using the .NET version of PDFBox.
Imports org.apache.pdfbox.pdmodel
Imports org.apache.pdfbox.util
Private Shared Function parseUsingPDFBox(ByVal input As String) As String
Dim doc As PDDocument = Nothing
Try
doc = PDDocument.load(input)
Dim stripper As New PDFTextStripper()
Return stripper.getText(doc)
Finally
If doc IsNot Nothing Then
doc.close()
End If
End Try
End Function
http://www.squarepdf.net/how-to-convert-pdf-to-text-in-net-vb
The code is extracting the plain visible text, but is not extracting the comments.
I have tried using FDFAnnotation.ToString() But it warns that ToString() is ambiguous...
doc = PDDocument.load(strFilename)
Dim stripper As New FDFAnnotationText
Return stripper.tostring(doc)
I have tried iTextSharp and with this I can extract them using PdfName.ANNOTS class, but wish to stick with PDFBox.
My preferred language is VB, but I am happy to accept answers in C# too.
Upvotes: 1
Views: 608
Reputation: 95928
I assume by "comments" you mean text annotations with Name value Comment. The following code outputs the Contents of all text annotations. If you mean a different annotation kind, you might have to adapt it:
Dim doc As PDDocument = PDDocument.loadNonSeq(New java.io.File("..."), Nothing)
Dim pages As java.util.List = doc.getDocumentCatalog().getAllPages()
For i = 0 To pages.size() - 1
Dim page As PDPage = pages.get(i)
Dim annotations As java.util.List = page.getAnnotations()
For j = 0 To annotations.size() - 1
Dim annotation As PDAnnotation = annotations.get(j)
If annotation.getSubtype() = "Text" Then
Console.WriteLine("{0}-{1} : {2}", i, j, annotation.getContents())
End If
Next
Next
doc.close()
Upvotes: 2