emeryville
emeryville

Reputation: 372

How to extract bold and non-bold text from pdf using R

I am using R for extracting text. The code below works well to extract the non-bold text from pdf but it ignores the bold part. Is there a way to extract both bold and non-bold text?

 news <-'http://www.frbe-kbsb.be/sites/manager/ICN/14-15/ind01.pdf'
 library(pdftools)
 library(tesseract)
 library(tiff)
 info <- pdf_info(news)
 numberOfPageInPdf <- as.numeric(info[2])
 numberOfPageInPdf
 for (i in 1:numberOfPageInPdf){
      bitmap <- pdf_render_page(news, page=i, dpi = 300, numeric = TRUE)
      file_name <- paste0("page", i, ".tiff") 
      file_tiff <- tiff::writeTIFF(bitmap, file_name)
      out <- ocr(file_name)
      file_txt <- paste0("text", i, ".txt") 
      writeLines(out, file_txt)
    }

Upvotes: 3

Views: 1338

Answers (2)

Ralf Stubner
Ralf Stubner

Reputation: 26833

There is no need to go through the PDF -> TIFF -> OCR loop, since pdftools::pdf_text() can read this file directly:

stringi::stri_split(pdf_text(news), regex = "\n")

Upvotes: 1

Mako212
Mako212

Reputation: 7312

I like using the tabulizer library for this. Here's a small example:

devtools::install_github("ropensci/tabulizer")
library(tabulizer)

news <-'http://www.frbe-kbsb.be/sites/manager/ICN/14-15/ind01.pdf'

# note that you need to specify UTF-8 as the encoding, otherwise your special characters
# won't come in correctly

page1 <- extract_tables(news, guess=TRUE, page = 1, encoding='UTF-8')

page1[[1]]

      [,1] [,2]                    [,3]       [,4]                [,5]    [,6]                [,7]      
 [1,] ""   "Division: 1"           ""         ""                  ""      ""                  "Série: A"
 [2,] ""   "514"                   ""         "Fontaine 1 KBSK 1" ""      ""                  "303"     
 [3,] "1"  "62529 WIRIG ANTHONY"   ""         "2501 1⁄2-1⁄2"      "51560" "CZEBE ATTILLA"     "2439"    
 [4,] "2"  "62359 BRUNNER NICOLAS" ""         "2443 0-1"          "51861" "PICEU TOM"         "2401"    
 [5,] "3"  "75655 CEKRO EKREM"     ""         "2393 0-1"          "10391" "GEIRNAERT STEVEN"  "2400"    
 [6,] "4"  "50211 MARECHAL ANDY"   ""         "2355 0-1"          "35181" "LEENHOUTS KOEN"    "2388"    
 [7,] "5"  "73059 CLAESEN PIETER"  ""         "2327 1⁄2-1⁄2"      "25615" "DECOSTER FREDERIC" "2373"    
 [8,] "6"  "63614 HOURIEZ CLEMENT" ""         "2304 1⁄2-1⁄2"      "44954" "MAENHOUT THIBAUT"  "2372"    
 [9,] "7"  "60369 CAPONE NICOLA"   ""         "2283 1⁄2-1⁄2"      "10430" "VERLINDE TIEME"    "2271"    
[10,] "8"  "70653 LE QUANG KIM"    ""         "2282 0-1"          "44636" "GRYSON WOUTER"     "2269"    
[11,] ""   ""                      "< 2361 >" "12 - 20"           ""      "< 2364 >"          ""      

You can also use the locate_areas function to specify a specific region if you only care about some of the tables. Note that for locate_areas to work, I had to download the file locally first; using the URL returned an error.

You'll note that each table is its own element in the returned list.

Here's an example using a custom region to just select the first table on each page:

customArea <- extract_tables(news, guess=FALSE, page = 1, area=list(c(84,27,232,569), encoding = 'UTF-8')

This is also a more direct method than using the OCR (Optical Character Recognition) library tesseract beacuse you're not relying on the OCR library to translate pixel arrangement back into text. In digital PDFs, each text element has an x and y position, and the tabulizer library uses that information to detect table heuristics and extract sensibly formatted data. You'll see you still have some clean up to do, but it's pretty manageable.

Edit: just for fun, here's a little example of starting the clean up with data.table

library(data.table)

cleanUp <- setDT(as.data.frame(page1[[1]]))

cleanUp[ ,  `:=` (Division = as.numeric(gsub("^.*(\\d+{1,2}).*", "\\1", grep('Division', cleanUp$V2, value=TRUE))),
  Series = as.character(gsub(".*:\\s(\\w).*","\\1", grep('Série:', cleanUp$V7, value=TRUE))))
  ][,ID := tstrsplit(V2," ", fixed=TRUE, keep = 1)
  ][, c("V1", "V3") := NULL
  ][-grep('Division', V2, fixed=TRUE)]

Here we've moved Division, Series, and ID into their own columns, and removed the Division header row. This is just the general idea, and would need a little refinement to apply to all 27 pages.

                       V2                V4    V5                V6   V7 Division Series    ID
 1:                   514 Fontaine 1 KBSK 1                          303        1      A   514
 2:   62529 WIRIG ANTHONY      2501 1/2-1/2 51560     CZEBE ATTILLA 2439        1      A 62529
 3: 62359 BRUNNER NICOLAS          2443 0-1 51861         PICEU TOM 2401        1      A 62359
 4:     75655 CEKRO EKREM          2393 0-1 10391  GEIRNAERT STEVEN 2400        1      A 75655
 5:   50211 MARECHAL ANDY          2355 0-1 35181    LEENHOUTS KOEN 2388        1      A 50211
 6:  73059 CLAESEN PIETER      2327 1/2-1/2 25615 DECOSTER FREDERIC 2373        1      A 73059
 7: 63614 HOURIEZ CLEMENT      2304 1/2-1/2 44954  MAENHOUT THIBAUT 2372        1      A 63614
 8:   60369 CAPONE NICOLA      2283 1/2-1/2 10430    VERLINDE TIEME 2271        1      A 60369
 9:    70653 LE QUANG KIM          2282 0-1 44636     GRYSON WOUTER 2269        1      A 70653
10:                                 12 - 20                < 2364 >             1      A    NA

Upvotes: 2

Related Questions