Lazarus Thurston
Lazarus Thurston

Reputation: 1287

how to extract bold text from a pdf using R

I have searched through SO and the closest I got to the answer was here. But my requirement is to get a simpler & more elegant way to extract bold from a simple paragraph of text of pdf. The pdftools package only extracts the plain text component. Does anyone know if there is any other way to simply detect bold tokens (or words) from a chunk of text in pdf. I use R so kindly restrict to suggestions in R.

Upvotes: 1

Views: 3325

Answers (4)

venrey
venrey

Reputation: 185

JUNE 2021 UPDATED ANSWER

I think this question needs an updated answer.

GOOD NEWS: The R package pdftools has included in its recent update the option to extract font-data from the pdfs. The function pdf_data has now an additional argument font_info described in the documentation:

font_info if TRUE, extract font-data for each box. Be careful, this requires a very recent version of poppler and will error otherwise.

A simple implementation using pdftools::pdf_data with font_info=TRUE shows that:

pdftools::pdf_data(pdf = "https://arxiv.org/pdf/2012.10582.pdf", font_info = TRUE)

enter image description here

REMARKS:

  1. Some bold fonts are indicated by suffixes of medium fonts (Medi) as in e.g., KBJWLM+NimbusRomNo9L-Medi.
  2. The italics are indicated by suffixes like ReguItal which stands for 'Regular Italics'. For example, ZBSJXS+NimbusRomNo9L-ReguItal.
  3. Regular fonts are obviously indicated by purely Regu suffixes as in VDTZKA+NimbusRomNo9L-Regu.

WARNING: This answer has never been tested for pdf images (scanned) with overlapped/overlayed text.

Upvotes: 3

Lazarus Thurston
Lazarus Thurston

Reputation: 1287

This answer is based on answers received from @hrbmstr and @ralf. So thanks to them. I've made the answers simpler (mainly taking out the peculiarity of the HTML conversion & file naming). Also it is tailored for MAC OS users (perhaps LINUX too) - not sure about Windows guys.

I presume you have pdftohtml installed on your machine. If not use brew install pdftohtml. If you donot have homebrew on your MAC then install it first. A link is provided to help you for homebrew.

Once you are sure pdftohtml is installed on the mac, go with this R function to extract bold from any pdf document.

library(magrittr)
library(rvest)
library(stringr)

# pass a pdf file in current directory to this function
extr_bold <- function(file) {
  basefile <- str_remove(file,"\\.pdf|\\.PDF")
  htmlfile <- paste0(basefile,"s",".html")
  if(!exists(htmlfile) ) 
    system2("pdftohtml",args = c("-i",file),stdout=NULL)
  nodevar <- read_html(htmlfile)
  x <- html_nodes(nodevar,xpath = ".//b")
  html_text(x)
}

Upvotes: 1

hrbrmstr
hrbrmstr

Reputation: 78832

Along with having a flexible toolkit, data science regularly requires out-of-the-box thinking (at least in my profession).

But, first, a thing about PDF files.

I don't think they are what you think they are. "Bold" (or "italic", etc.) isn't "metadata". You should spend some time reading up on PDF files because they are complex, nasty, evil things that you are likely to encounter often when working with data. Read this — https://stackoverflow.com/a/19777953/1457051 — to see what finding bold text actually entails (follow the link to the 1.8.x Java pdfbox solution).

Back to our irregularly scheduled answering

While I'm one of the YUGEst proponents of R, not everything needs to be done or should be done in R. Sure, we'll use R to eventually get your bold text but we'll use a helper command-line utility to do so.

The pdftools package is based on the poppler library. It comes with the source so "I'm just an R user" folks likely don't have the full poppler toolset on their system.

Mac folks can use Homebrew to (once you get Homebrew setup):

  • brew install poppler

Linux folks know how to do things. Windows folks are lost forever (there are poppler binaries for you, but your time would be better spent switching to a real operating system).

Once you do that, you can use the below to achieve your goal.

First, we'll make a helper function with lots of safety bumpers:

#' Uses the command-line pdftohtml function from the poppler library
#' to convert a PDF to HTML and then read it in with xml2::read_html()
#'
#' @md
#' @param path the path to the file [path.expand()] will be run on this value
#' @param extra_args extra command-line arguments to be passed to `pdftohtml`.
#'        They should be supplied as you would supply arguments to the `args`
#'        parameter of [system2()].
read_pdf_as_html <- function(path, extra_args=character()) {

  # make sure poppler/pdftohtml is installed
  pdftohtml <- Sys.which("pdftohtml")
  if (pdftohtml == "") {
    stop("The pdftohtml command-line utility must be installed.", call.=FALSE)
  }

  # make sure the file exists
  path <- path.expand(path)
  stopifnot(file.exists(path))

  # pdf's should really have a PDF extension
  stopifnot(tolower(tools::file_ext(path)) == "pdf")

  # get by with a little help from our friends
  suppressPackageStartupMessages({
    library(xml2, warn.conflicts = FALSE, quietly = TRUE)
    library(rvest, warn.conflicts = FALSE, quietly = TRUE)
  })

  # we're going to do the conversion in a temp directory space
  td <- tempfile(fileext = "_dir")
  dir.create(td)
  on.exit(unlink(td, recursive=TRUE), add=TRUE)

  # save our current working directory
  curwd <- getwd()
  on.exit(setwd(curwd), add=TRUE)

  # move to the temp space
  setwd(td)
  file.copy(path, td)

  # collect the extra arguments
  c(
    "-i" # ignore images
  ) -> args

  args <- c(args, extra_args, basename(path), "r-doc") # saves it to r-doc-html.html

  # this could take seconds so inform users what's going on
  message("Converting ", basename(path), "...")

  # we'll let stderr display so you can debug errors
  system2(
    command = pdftohtml,
    args = args,
    stdout = TRUE
  ) -> res

  res <- gsub("^Page-", "", res[length(res)])
  message("Converted ", res, " pages")

  # this will need to be changed if poppler ever does anything different
  xml2::read_html("r-docs.html")

}

Now, we'll use it:

doc <- read_pdf_as_html("~/Data/Mulla__Indian_Contract_Act2018-11-12_01-00.PDF")

bold_tags <- html_nodes(doc, xpath=".//b")

bold_words <- html_text(bold_tags)

head(bold_words, 20)
##  [1] "Preamble"                                                                                   
##  [2] "WHEREAS it is expedient to define and amend certain parts of the law relating to contracts;"
##  [3] "History"                                                                                    
##  [4] "Ancient and Medieval Period"                                                                
##  [5] "The Introduction of English Law Into India"                                                 
##  [6] "Mofussal Courts"                                                                            
##  [7] "Legislation"                                                                                
##  [8] "The Indian Contract Act 1872"                                                               
##  [9] "The Making of the Act"                                                                      
## [10] "Law of Contract Until 1950"                                                                 
## [11] "The Law of Contract after 1950"                                                             
## [12] "Amendments to This Act"                                                                     
## [13] "Other Laws Affecting Contracts and Enforcement"                                             
## [14] "Recommendations of the Indian Law Commission"                                               
## [15] "Section 1."                                                                                 
## [16] "Short title"                                                                                
## [17] "Extent, Commencement."                                                                      
## [18] "Enactments Repealed."                                                                       
## [19] "Applicability of the Act"                                                                   
## [20] "Scheme of the Act"

length(bold_words)
## [1] 1939

No Java required at all and you've got your bold words.

If you do want to go the pdfbox-app route as Ralf noted, you can use this wrapper to make it easier to work with:

read_pdf_as_html_with_pdfbox <- function(path) {

  java <- Sys.which("java")
  if (java == "") {
    stop("Java binary is not on the system PATH.", call.=FALSE)
  }

  # get by with a little help from our friends
  suppressPackageStartupMessages({
    library(httr, warn.conflicts = FALSE, quietly = TRUE)
    library(xml2, warn.conflicts = FALSE, quietly = TRUE)
    library(rvest, warn.conflicts = FALSE, quietly = TRUE)
  })

  path <- path.expand(path)
  stopifnot(file.exists(path))

  # pdf's should really have a PDF extension
  stopifnot(tolower(tools::file_ext(path)) == "pdf")

  # download the pdfbox "app" if not installed
  if (!dir.exists("~/.pdfboxjars")) {
    message("~/.pdfboxjars not found. Creating it and downloading pdfbox-app jar...")
    dir.create("~/.pdfboxjars")
    httr::GET(
      url = "http://central.maven.org/maven2/org/apache/pdfbox/pdfbox-app/2.0.12/pdfbox-app-2.0.12.jar",
      httr::write_disk(file.path("~/.pdfboxjars", "pdfbox-app-2.0.12.jar")),
      httr::progress()
    ) -> res
    httr::stop_for_status(res)
  }

  # we're going to do the conversion in a temp directory space
  tf <- tempfile(fileext = ".html")
  on.exit(unlink(tf), add=TRUE)

  c(
    "-jar",
    path.expand(file.path("~/.pdfboxjars", "pdfbox-app-2.0.12.jar")),
    "ExtractText",
    "-html",
    path,
    tf
  ) -> args

  # this could take seconds so inform users what's going on
  message("Converting ", basename(path), "...")

  system2(
    command = java,
    args = args
  ) -> res

  xml2::read_html(tf)

}

Upvotes: 3

Ralf Stubner
Ralf Stubner

Reputation: 26833

You don't have to use tabularizer, but I don't know a way that does not involve Java. I had hoped that Apache Tika via the rtika package can be used. However, bold text is not rendered as such. However, one can use pdfbox as shown in that ticket:

 java -jar <pdfbox-jar> ExtractText -html <pdf-file> <html-file>

This command would normally started in a shell, but you can also use system(2) from within R. Then in R use

html <- xml2::read_html(<html-file>)
bold <- xml2::xml_find_all(html, '//b')
head(xml2::xml_contents(bold))

to process the HTML file. With your document this returns

{xml_nodeset (6)}
[1] Preamble\n
[2] WHEREAS it is expedient to define and amend certain parts of the law relating to contracts;\n
[3] History\n
[4] Ancient and Medieval Period\n
[5] The Introduction of English Law Into India\n
[6] Mofussal Courts\n

Upvotes: 2

Related Questions