Reputation: 154

Extract comments from pdf

I have a collection of .pdf files with comments that were added in Adobe Acrobat. I would like to be able to analyze these comments, but I'm kind of stuck on extracting them. I've looked at the pdftools package, but it seems to only be able to extract the text and not the comments. Is there a method available for extracting the comments within R?

Upvotes: 12

Answers (5)

Emmanuel Hamel

Reputation: 2233

You can consider calling the Python package from R as follows :

library(reticulate)

conda_Env <- conda_list()

if(any(conda_Env[, 1] == "PyMuPDF") == FALSE)
{
  reticulate::conda_create(envname = "PyMuPDF", python_version = "3.7.16")
  reticulate::conda_install(envname = "PyMuPDF", packages = "PyMuPDF", pip = TRUE)
}

reticulate::use_condaenv(condaenv = "PyMuPDF")
path_To_PDF <- "C:/Annotated text.pdf"
fitz <- import("fitz")
doc <- fitz$open(path_To_PDF)
nb_Page <- doc$page_count
list_Comments <- list()
list_Highlights <- list()

##########################
#### Highlighted text ####
##########################
for(i in 1 : nb_Page)
{
  page <- doc[i - 1]
  annots <- page$annots()
  list_Highlights[[i]] <- page$get_textbox(iter_next(annots)$rect)
}

########################
#### Commented text ####
########################
counter <- 1

for(i in 1 : nb_Page)
{
  page <- doc[i - 1]
  annots <- page$annots()
  cond <- TRUE

  while(cond == TRUE)
  {
    content <- iter_next(annots)$info$content
    cond <- !is.null(content)
    list_Comments[[counter]] <- content
    counter <- counter + 1
  }
}

Upvotes: 0

Emmanuel Hamel

Reputation: 2233

I have been able to extract the comments of a PDF file with Ghostscript and R. First, I convert the PDF file with Ghostscript. This generates markers in the PDF file when you read the PDF file as a txt file. Afterwards, I read the pdf file as a txt file and extract the comments with a regex.

library(stringr)
system2("C:\\Program Files\\gs\\gs10.02.0\\bin\\gswin64c.exe",args="-sDEVICE=pdfwrite -dDOPDFMARKS -dNOPAUSE -dPreserveAnnots=true -dBATCH -o C:\\Annotated_F.pdf C:\\Annotated.pdf")
fileConn <- file("C:\\Users\\manuh\\OneDrive\\Desktop\\Test\\Annotated_F.pdf")
txt <- readLines(fileConn)
bool_Txt <- stringr::str_detect(txt, "/Contents\\(")
txt_Comments <- txt[bool_Txt]
txt_Comments <- stringr::str_remove(txt_Comments, "/Contents\\(")
nb_Char <- nchar(txt_Comments)
txt_Comments <- stringr::str_sub(txt_Comments, end = nb_Char - 1)
txt_Comments

[1] "Commentaires 1" "Commentaire 2"

The PDF files used in the example are available here : https://github.com/ManuHamel/R_Examples_PDF_Annotation

Upvotes: 0

Rastameman

Reputation: 1

Screenshot of how >> Export the comments as an Excel file, then import it into R?

Eg: in PDF-X-change Editor, go to comment > summarize comments > export into whatever format you want. Similar in Adobe.

Upvotes: 0

Bernuly

Reputation: 99

PyMuPDF (https://pymupdf.readthedocs.io/en/latest/) is the only python library I have found working.

Installation in Debian/Ubuntu-based distributions:

apt-get install python3-fitz

Script:

import fitz
doc = fitz.open("example.pdf")
for i in range(doc.pageCount):
  page = doc[i]
  for annot in page.annots():
    print(annot.info["content"])

Upvotes: 8

PDFix

Reputation: 1

Did you try PoDoFo or another OpenSource tool that can access the PDF elements? You can also look at Extracting PDF annotations/comments here on stackoverflow if you will do little programming

Upvotes: 0

Extract comments from pdf

Answers (5)

Related Questions