Reputation: 1193
I am trying to extract comments from a PDF using Python. These are the two pieces of code that I have tested:
One using PyPDF2
:
import PyPDF2
src = 'xxxx.pdf'
input1 = PyPDF2.PdfFileReader(open(src, "rb"))
nPages = input1.getNumPages()
df_comments = pd.DataFrame()
for i in range(nPages) :
annotation = []
page = []
page0 = input1.getPage(i)
try :
for annot in page0['/Annots'] :
annotation.append(annot.getObject())
page = [i+1] * len(annotation)
page = pd.DataFrame(page)
annotation = pd.DataFrame(annotation)
df_temp = pd.concat([page, annotation], axis=1)
df_comments = pd.concat([df_comments, df_temp], ignore_index=True)
except :
# there are no annotations on this page
pass
and the other using fitz
:
import fitz
doc = fitz.open(src)
for i in range(doc.pageCount):
page = doc[i]
for annot in page.annots():
print(annot.info)
The comments are getting extracted, however when I check the PDF I see that the comments are not being extracted sequentially. I have tried to check other parameters like creation date, modification date but that is not helping me.
Is their a way I can extract them serially as they are appearing in the PDF? Or Can I extract the text as well from the PDF against which the comment has been tagged?
Upvotes: 3
Views: 4588
Reputation: 136845
I'm the current maintainer of PyPDF2.
The annotations are currently extracted in the order they appear in the annotations dictionary.
If you have a sensible way to sort them, feel free to open a feature request in the PyPDF2 issue tracker on github.
Upvotes: 2