Reputation: 154
I have a collection of .pdf files with comments that were added in Adobe Acrobat. I would like to be able to analyze these comments, but I'm kind of stuck on extracting them. I've looked at the pdftools package, but it seems to only be able to extract the text and not the comments. Is there a method available for extracting the comments within R?
Upvotes: 12
Views: 7342
Reputation: 2233
You can consider calling the Python package from R as follows :
library(reticulate)
conda_Env <- conda_list()
if(any(conda_Env[, 1] == "PyMuPDF") == FALSE)
{
reticulate::conda_create(envname = "PyMuPDF", python_version = "3.7.16")
reticulate::conda_install(envname = "PyMuPDF", packages = "PyMuPDF", pip = TRUE)
}
reticulate::use_condaenv(condaenv = "PyMuPDF")
path_To_PDF <- "C:/Annotated text.pdf"
fitz <- import("fitz")
doc <- fitz$open(path_To_PDF)
nb_Page <- doc$page_count
list_Comments <- list()
list_Highlights <- list()
##########################
#### Highlighted text ####
##########################
for(i in 1 : nb_Page)
{
page <- doc[i - 1]
annots <- page$annots()
list_Highlights[[i]] <- page$get_textbox(iter_next(annots)$rect)
}
########################
#### Commented text ####
########################
counter <- 1
for(i in 1 : nb_Page)
{
page <- doc[i - 1]
annots <- page$annots()
cond <- TRUE
while(cond == TRUE)
{
content <- iter_next(annots)$info$content
cond <- !is.null(content)
list_Comments[[counter]] <- content
counter <- counter + 1
}
}
Upvotes: 0
Reputation: 2233
I have been able to extract the comments of a PDF file with Ghostscript and R. First, I convert the PDF file with Ghostscript. This generates markers in the PDF file when you read the PDF file as a txt file. Afterwards, I read the pdf file as a txt file and extract the comments with a regex.
library(stringr)
system2("C:\\Program Files\\gs\\gs10.02.0\\bin\\gswin64c.exe",args="-sDEVICE=pdfwrite -dDOPDFMARKS -dNOPAUSE -dPreserveAnnots=true -dBATCH -o C:\\Annotated_F.pdf C:\\Annotated.pdf")
fileConn <- file("C:\\Users\\manuh\\OneDrive\\Desktop\\Test\\Annotated_F.pdf")
txt <- readLines(fileConn)
bool_Txt <- stringr::str_detect(txt, "/Contents\\(")
txt_Comments <- txt[bool_Txt]
txt_Comments <- stringr::str_remove(txt_Comments, "/Contents\\(")
nb_Char <- nchar(txt_Comments)
txt_Comments <- stringr::str_sub(txt_Comments, end = nb_Char - 1)
txt_Comments
[1] "Commentaires 1" "Commentaire 2"
The PDF files used in the example are available here : https://github.com/ManuHamel/R_Examples_PDF_Annotation
Upvotes: 0
Reputation: 1
Screenshot of how >> Export the comments as an Excel file, then import it into R?
Eg: in PDF-X-change Editor
, go to comment > summarize comments > export
into whatever format you want. Similar in Adobe.
Upvotes: 0
Reputation: 99
PyMuPDF (https://pymupdf.readthedocs.io/en/latest/) is the only python library I have found working.
Installation in Debian/Ubuntu-based distributions:
apt-get install python3-fitz
Script:
import fitz
doc = fitz.open("example.pdf")
for i in range(doc.pageCount):
page = doc[i]
for annot in page.annots():
print(annot.info["content"])
Upvotes: 8
Reputation: 1
Did you try PoDoFo or another OpenSource tool that can access the PDF elements? You can also look at Extracting PDF annotations/comments here on stackoverflow if you will do little programming
Upvotes: 0