Reputation: 372
I am using R for extracting text. The code below works well to extract the non-bold text from pdf but it ignores the bold part. Is there a way to extract both bold and non-bold text?
news <-'http://www.frbe-kbsb.be/sites/manager/ICN/14-15/ind01.pdf'
library(pdftools)
library(tesseract)
library(tiff)
info <- pdf_info(news)
numberOfPageInPdf <- as.numeric(info[2])
numberOfPageInPdf
for (i in 1:numberOfPageInPdf){
bitmap <- pdf_render_page(news, page=i, dpi = 300, numeric = TRUE)
file_name <- paste0("page", i, ".tiff")
file_tiff <- tiff::writeTIFF(bitmap, file_name)
out <- ocr(file_name)
file_txt <- paste0("text", i, ".txt")
writeLines(out, file_txt)
}
Upvotes: 3
Views: 1338
Reputation: 26833
There is no need to go through the PDF -> TIFF -> OCR loop, since pdftools::pdf_text()
can read this file directly:
stringi::stri_split(pdf_text(news), regex = "\n")
Upvotes: 1
Reputation: 7312
I like using the tabulizer
library for this. Here's a small example:
devtools::install_github("ropensci/tabulizer")
library(tabulizer)
news <-'http://www.frbe-kbsb.be/sites/manager/ICN/14-15/ind01.pdf'
# note that you need to specify UTF-8 as the encoding, otherwise your special characters
# won't come in correctly
page1 <- extract_tables(news, guess=TRUE, page = 1, encoding='UTF-8')
page1[[1]]
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] "" "Division: 1" "" "" "" "" "Série: A"
[2,] "" "514" "" "Fontaine 1 KBSK 1" "" "" "303"
[3,] "1" "62529 WIRIG ANTHONY" "" "2501 1⁄2-1⁄2" "51560" "CZEBE ATTILLA" "2439"
[4,] "2" "62359 BRUNNER NICOLAS" "" "2443 0-1" "51861" "PICEU TOM" "2401"
[5,] "3" "75655 CEKRO EKREM" "" "2393 0-1" "10391" "GEIRNAERT STEVEN" "2400"
[6,] "4" "50211 MARECHAL ANDY" "" "2355 0-1" "35181" "LEENHOUTS KOEN" "2388"
[7,] "5" "73059 CLAESEN PIETER" "" "2327 1⁄2-1⁄2" "25615" "DECOSTER FREDERIC" "2373"
[8,] "6" "63614 HOURIEZ CLEMENT" "" "2304 1⁄2-1⁄2" "44954" "MAENHOUT THIBAUT" "2372"
[9,] "7" "60369 CAPONE NICOLA" "" "2283 1⁄2-1⁄2" "10430" "VERLINDE TIEME" "2271"
[10,] "8" "70653 LE QUANG KIM" "" "2282 0-1" "44636" "GRYSON WOUTER" "2269"
[11,] "" "" "< 2361 >" "12 - 20" "" "< 2364 >" ""
You can also use the locate_areas
function to specify a specific region if you only care about some of the tables. Note that for locate_areas
to work, I had to download the file locally first; using the URL returned an error.
You'll note that each table is its own element in the returned list.
Here's an example using a custom region to just select the first table on each page:
customArea <- extract_tables(news, guess=FALSE, page = 1, area=list(c(84,27,232,569), encoding = 'UTF-8')
This is also a more direct method than using the OCR (Optical Character Recognition) library tesseract
beacuse you're not relying on the OCR library to translate pixel arrangement back into text. In digital PDFs, each text element has an x and y position, and the tabulizer
library uses that information to detect table heuristics and extract sensibly formatted data. You'll see you still have some clean up to do, but it's pretty manageable.
Edit: just for fun, here's a little example of starting the clean up with data.table
library(data.table)
cleanUp <- setDT(as.data.frame(page1[[1]]))
cleanUp[ , `:=` (Division = as.numeric(gsub("^.*(\\d+{1,2}).*", "\\1", grep('Division', cleanUp$V2, value=TRUE))),
Series = as.character(gsub(".*:\\s(\\w).*","\\1", grep('Série:', cleanUp$V7, value=TRUE))))
][,ID := tstrsplit(V2," ", fixed=TRUE, keep = 1)
][, c("V1", "V3") := NULL
][-grep('Division', V2, fixed=TRUE)]
Here we've moved Division
, Series
, and ID
into their own columns, and removed the Division
header row. This is just the general idea, and would need a little refinement to apply to all 27 pages.
V2 V4 V5 V6 V7 Division Series ID
1: 514 Fontaine 1 KBSK 1 303 1 A 514
2: 62529 WIRIG ANTHONY 2501 1/2-1/2 51560 CZEBE ATTILLA 2439 1 A 62529
3: 62359 BRUNNER NICOLAS 2443 0-1 51861 PICEU TOM 2401 1 A 62359
4: 75655 CEKRO EKREM 2393 0-1 10391 GEIRNAERT STEVEN 2400 1 A 75655
5: 50211 MARECHAL ANDY 2355 0-1 35181 LEENHOUTS KOEN 2388 1 A 50211
6: 73059 CLAESEN PIETER 2327 1/2-1/2 25615 DECOSTER FREDERIC 2373 1 A 73059
7: 63614 HOURIEZ CLEMENT 2304 1/2-1/2 44954 MAENHOUT THIBAUT 2372 1 A 63614
8: 60369 CAPONE NICOLA 2283 1/2-1/2 10430 VERLINDE TIEME 2271 1 A 60369
9: 70653 LE QUANG KIM 2282 0-1 44636 GRYSON WOUTER 2269 1 A 70653
10: 12 - 20 < 2364 > 1 A NA
Upvotes: 2