Reputation: 167
I'm trying to parse some historic crude oil price data using tabulizer and running into what appear to be encoding errors. Below is a reproducible example with one of the files I want to scrape.
library(tidyverse)
library(tabulizer)
library(pdftools)
#example file
file <- "https://paalp.s3.amazonaws.com/plains/media/bulletins/paa/monthly/2000/December%202000.pdf"
#download file
#check file info - nothing on encoding per se
pdf_info(file)
The pdf_info returns nothing extraordinary except that the file is relatively old. I've tried to run these as tables or as text and I get different types of errors, but the text call seems to reveal where the issues might be.
#attempt to parse table from page 1
pricing_tables<-extract_tables(file,page=1)
#grab as text
pricing_text<-extract_text("Files/test.pdf",page=1)
When I grab as a table,the first columns of pricing are not deliminted, extra decimals are inserted, etc.
When I grad as text, I can see the encoding characters which seem to throw some of it for a loop.
West Texas Intermediate - Area #1...................................................…......................................................…30.75 * 28 75 * 28 00 * 26.25 * 26.75 * 26.25 *\r\n
A couple of other testing bits. First, I can render the PDF into a bitmap and the content transfers appropritely. Not sure whether that is informative at all.
# render into raw bitmap
bitmap <- pdf_render_page(file)
png::writePNG(bitmap, "test.png")
Next, the same errors are reproduced if I copy from the PDF into Excel:
Any thoughts or help much appreciated.
Andrew
Upvotes: 1
Views: 355
Reputation: 2213
I have been able to extract the tables with the following code :
library(RDCOMClient)
library(magick)
path_PDF <- "C:\\December%202000.pdf"
path_Word <- "C:\\temp.docx"
####################################################################
#### Step 1 : We use the OCR of Word to convert the PDF to word ####
####################################################################
wordApp <- COMCreate("Word.Application")
wordApp[["Visible"]] <- TRUE
wordApp[["DisplayAlerts"]] <- FALSE
doc <- wordApp[["Documents"]]$Open(normalizePath(path_PDF),
ConfirmConversions = FALSE)
doc$SaveAs2(path_Word)
###############################################
#### Step 2 : Convert word document to pdf ####
###############################################
nb_Table <- doc$tables()$count()
list_Table <- list()
for(l in 1 : nb_Table)
{
nb_Row <- doc$tables(l)$Rows()$Count()
nb_Col <- doc$tables(l)$Columns()$Count()
mat_Temp <- matrix(NA, nrow = nb_Row, ncol = nb_Col)
for(i in 1 : nb_Row)
{
for(j in 1 : nb_Col)
{
mat_Temp[i, j] <- tryCatch(doc$tables(l)$cell(i, j)$range()$text(), error = function(e) NA)
}
}
list_Table[[l]] <- mat_Temp
}
list_Table
[[1]]
[,1]
[1,] "\tBULLETIN NO:\t2000-199\r\a"
[2,] "\tEFFECTIVE DATE:\t 11-30-00\r\a"
[3,] " $/BBL\rTEXAS\r\a"
[4,] " West Texas Intermediate - Area #1..............................\t30.75 *\r\a"
[5,] " West Texas Intermediate - All Other Areas..................\t31.00 *\r\a"
[6,] " West Texas Sour.........................................................\t28.00 *\r\a"
[7,] " West Central Texas......................................................\t31.00 *\r\a"
[8,] " North Texas Sweet.......................................................\t31.00 *\r\a"
[9,] " North Texas (Cooke & Grayson) Sweet........................\t31.00 *\r\a"
[10,] " North Texas (Cooke & Grayson) Sour..........................\t29.90 *\r\a"
[11,] " East Texas Field..........................................................\t30.75 *\r\a"
[12,] " East Texas (Other than E. TX. Field)............................\t30.75 *\r\a"
[13,] " East Texas Sour...........................................................\t24.50 *\r\a"
[14,] " East Texas Light Sour..................................................\t28.75 *\r\a"
[15,] " Texas Upper Gulf Coast...............................................\t30.00 *\r\a"
[16,] " Texas Central Gulf Coast (Giddings).............................\t30.00 *\r\a"
[17,] " South Texas Light (Sweet)............................................\t30.00 *\r\a"
[18,] " Big Wells Intermediate..................................................\t29.75 *\r\a"
[19,] " South Texas Heavy.......................................................\t30.00 *\r\a"
[20,] " South Texas Sour.........................................................\t27.25 *\r\a"
[21,] "NEW MEXICO\t \r\a"
[22,] " New Mexico Intermediate..............................................\t31.00 *\r\a"
[23,] " New Mexico Sour.........................................................\t28.00 *\r\a"
[24,] "LOUISIANA\t \r\a"
[25,] " North Louisiana............................................................\t30.75 *\r\a"
[26,] " Central Louisiana Sweet.........................................................................\r\a"
[27,] " South Louisiana Light Sweet (Onshore)........................\t30.75 *\r\a"
[28,] " Ferriday Area....................................\r\a"
[29,] " South Louisiana Eugene Island (Onshore)....................\r\a"
[30,] "COLORADO\t \r\a"
[31,] " Colorado Southeastern.................................................\r\a"
[32,] " Colorado Eastern..........................................................\r\a"
[33,] " Colorado Denver Area...................................................\r\a"
[34,] " Colorado Western.........................................................\r\a"
[35,] "ILLINOIS\t \r\a"
[36,] " Illinois Sweet.................................................................\r\a"
[37,] "INDIANA\t \r\a"
[38,] " Indiana Sweet...............................................................\r\a"
[39,] "KANSAS\t \r\a"
[40,] " Kansas Common..........................................................\r\a"
[41,] " Eastern Kansas Common.............................................\r\a"
[42,] " Northwest Kansas Sweet..............................................\r\a"
[43,] " Southwest Kansas Sweet.............................................\r\a"
[44,] "KENTUCKY\t \r\a"
[45,] " Western Kentucky Sweet.............................................\r\a"
[46,] "MISSISSIPPI\r\a"
[47,] " Mississippi Light Sweet & Sour.....................................\r\a"
[48,] "NEBRASKA\t \r\a"
[49,] " Nebraska Western........................................................\r\a"
[50,] "NORTH DAKOTA\r\a"
[51,] " Williston Basin Sweet...................................................\r\a"
[52,] " Williston Basin Sour......................................................\r\a"
[53,] "OKLAHOMA\t \r\a"
[54,] " Domestic Sweet (Cushing)..........................…\r\a"
[55,] " Oklahoma Sweet..........................................................\r\a"
[56,] " Oklahoma Sweet-Central..............................................\r\a"
[57,] " Western Oklahoma (effective Nov. 14, 2000)..................\r\a"
[58,] " Oklahoma Panhandle (effective Nov. 14, 2000) ............\r\a"
[59,] "WYOMING\r\a"
[60,] " Wyoming Southwestern Area.......................................\r\a"
[61,] " Wyoming Southeastern Area........................................\r\a"
[62,] " Wyoming Sweet Other Areas.......................................\r\a"
[63,] " Wyoming General Sour................................................\r\a"
[64,] " Wyoming Asphaltic Sour.............................................. 27.00 * 1 The gravity adjustment scale and other terms are set forth on Page 5.\r\a"
[,2] [,3] [,4] [,5]
[1,] "2000-200\r\a" "2000-201\r\a" "2000-202\r\a" "2000-203\r\a"
[2,] " 12-01-00\r\a" " 12-04-00\r\a" " 12-05-00\r\a" " 12-06-00\r\a"
[3,] " $/BBL\r\a" " $/BBL\r\a" " $/BBL\r\a" " $/BBL\r\a"
[4,] "\r\a" "28.75 *\r\a" "\r\a" "28.00 *\r\a"
[5,] "\r\a" "29.00 *\r\a" "\r\a" "28.25 *\r\a"
[6,] "\r\a" "26.00 *\r\a" "\r\a" "25.25 *\r\a"
[7,] "\r\a" "29.00 *\r\a" "\r\a" "28.25 *\r\a"
[8,] "\r\a" "29.00 *\r\a" "\r\a" "28.25 *\r\a"
[9,] "\r\a" "29.00 *\r\a" "\r\a" "28.25 *\r\a"
[10,] "\r\a" "27.90 *\r\a" "\r\a" "27.15 *\r\a"
[11,] "\r\a" "28.75 *\r\a" "\r\a" "28.00 *\r\a"
[12,] "\r\a" "28.75 *\r\a" "\r\a" "28.00 *\r\a"
[13,] "\r\a" "22.50 *\r\a" "\r\a" "21.75 *\r\a"
[14,] "\r\a" "26.75 *\r\a" "\r\a" "26.00 *\r\a"
[15,] "\r\a" "28.00 *\r\a" "\r\a" "27.00 *\r\a"
[16,] "\r\a" "28.00 *\r\a" "\r\a" "27.00 *\r\a"
[17,] "\r\a" "28.00 *\r\a" "\r\a" "27.25 *\r\a"
[18,] "\r\a" "27.75 *\r\a" "\r\a" "27.00 *\r\a"
[19,] "\r\a" "28.00 *\r\a" "\r\a" "27.25 *\r\a"
[20,] "\r\a" "25.25 *\r\a" "\r\a" "24.50 *\r\a"
[21,] " \r\a" "\r\a" " \r\a" "\r\a"
[22,] "\r\a" "29.00 *\r\a" "\r\a" "28.25 *\r\a"
[23,] "\r\a" "26.00 *\r\a" "\r\a" "25.25 *\r\a"
[24,] " \r\a" "\r\a" " \r\a" "\r\a"
[25,] "28.75 *\r\a" "\r\a" "28.00 *\r\a" "\r\a"
[26,] "28.75 *\r\a" "\r\a" "28.00 *\r\a" "\r\a"
[27,] "28.75 *\r\a" "\r\a" "27.75 *\r\a" "\r\a"
[28,] "31.75 *\r\a" "\r\a" "29.75 *\r\a" "\r\a"
[29,] "28.25 *\r\a" "\r\a" "26.25 *\r\a" "\r\a"
[30,] "\r\a" " \r\a" "\r\a" " \r\a"
[31,] "29.85 *\r\a" "\r\a" "27.85 *\r\a" "\r\a"
[32,] "29.50 *\r\a" "\r\a" "27.50 *\r\a" "\r\a"
[33,] "30.00 *\r\a" "\r\a" "28.00 *\r\a" "\r\a"
[34,] "33.50 *\r\a" "\r\a" "31.50 *\r\a" "\r\a"
[35,] "\r\a" " \r\a" "\r\a" " \r\a"
[36,] "31.00 *\r\a" "\r\a" "29.00 *\r\a" "\r\a"
[37,] "\r\a" " \r\a" "\r\a" " \r\a"
[38,] "31.00 *\r\a" "\r\a" "29.00 *\r\a" "\r\a"
[39,] "\r\a" " \r\a" "\r\a" " \r\a"
[40,] "30.25 *\r\a" "\r\a" "28.25 *\r\a" "\r\a"
[41,] "30.25 *\r\a" "\r\a" "28.25 *\r\a" "\r\a"
[42,] "30.00 *\r\a" "\r\a" "28.00 *\r\a" "\r\a"
[43,] "30.00 *\r\a" "\r\a" "28.00 *\r\a" "\r\a"
[44,] "\r\a" " \r\a" "\r\a" " \r\a"
[45,] "31.00 *\r\a" "\r\a" "29.00 *\r\a" "\r\a"
[46,] "\r\a" " \r\a" "\r\a" " \r\a"
[47,] "30.75 *\r\a" "\r\a" "28.75 *\r\a" "\r\a"
[48,] "\r\a" " \r\a" "\r\a" " \r\a"
[49,] "29.50 *\r\a" "\r\a" "27.50 *\r\a" "\r\a"
[50,] "\r\a" " \r\a" "\r\a" " \r\a"
[51,] "29.30 *\r\a" "\r\a" "27.30 *\r\a" "\r\a"
[52,] "26.65 *\r\a" "\r\a" "24.65 *\r\a" "\r\a"
[53,] "\r\a" " \r\a" "\r\a" " \r\a"
[54,] "31.00 *\r\a" "\r\a" "29.00 *\r\a" "\r\a"
[55,] "31.00 *\r\a" "\r\a" "29.00 *\r\a" "\r\a"
[56,] "31.00 *\r\a" "\r\a" "29.00 *\r\a" "\r\a"
[57,] "30.50 *\r\a" "\r\a" "28.50 *\r\a" "\r\a"
[58,] "30.50 *\r\a" "\r\a" "28.50 *\r\a" "\r\a"
[59,] "\r\a" " \r\a" "\r\a" " \r\a"
[60,] "31.50 *\r\a" "\r\a" "29.50 *\r\a" "\r\a"
[61,] "30.00 *\r\a" "\r\a" "28.00 *\r\a" "\r\a"
[62,] "30.50 *\r\a" "\r\a" "28.50 *\r\a" "\r\a"
[63,] "27.00 *\r\a" "\r\a" "25.00 *\r\a" "\r\a"
[64,] "\r\a" "25.00 *\r\a" "\r\a" "24.25 *\r\a"
[,6] [,7] [,8] [,9] [,10] [,11]
[1,] "2000-204\r\a" NA NA NA NA NA
[2,] " 12-07-00\r\a" NA NA NA NA NA
[3,] " $/BBL\r\a" NA NA NA NA NA
[4,] "\r\a" "26.25 *\r\a" "\r\a" "26.75 *\r\a" "\r\a" "26.25 *\r\a"
[5,] "\r\a" "26.50 *\r\a" "\r\a" "27.00 *\r\a" "\r\a" "26.50 *\r\a"
[6,] "\r\a" "23.50 *\r\a" "\r\a" "24.00 *\r\a" "\r\a" "23.50 *\r\a"
[7,] "\r\a" "26.50 *\r\a" "\r\a" "27.00 *\r\a" "\r\a" "26.50 *\r\a"
[8,] "\r\a" "26.50 *\r\a" "\r\a" "27.00 *\r\a" "\r\a" "26.50 *\r\a"
[9,] "\r\a" "26.50 *\r\a" "\r\a" "27.00 *\r\a" "\r\a" "26.50 *\r\a"
[10,] "\r\a" "25.40 *\r\a" "\r\a" "25.90 *\r\a" "\r\a" "25.40 *\r\a"
[11,] "\r\a" "26.25 *\r\a" "\r\a" "26.75 *\r\a" "\r\a" "26.25 *\r\a"
[12,] "\r\a" "26.25 *\r\a" "\r\a" "26.75 *\r\a" "\r\a" "26.25 *\r\a"
[13,] "\r\a" "20.00 *\r\a" "\r\a" "20.50 *\r\a" "\r\a" "20.00 *\r\a"
[14,] "\r\a" "24.25 *\r\a" "\r\a" "24.75 *\r\a" "\r\a" "24.25 *\r\a"
[15,] "\r\a" "25.25 *\r\a" "\r\a" "25.75 *\r\a" "\r\a" "25.25 *\r\a"
[16,] "\r\a" "25.25 *\r\a" "\r\a" "25.75 *\r\a" "\r\a" "25.25 *\r\a"
[17,] "\r\a" "25.50 *\r\a" "\r\a" "26.00 *\r\a" "\r\a" "25.50 *\r\a"
[18,] "\r\a" "25.25 *\r\a" "\r\a" "25.75 *\r\a" "\r\a" "25.25 *\r\a"
[19,] "\r\a" "25.50 *\r\a" "\r\a" "26.00 *\r\a" "\r\a" "25.50 *\r\a"
[20,] "\r\a" "22.75 *\r\a" "\r\a" "23.25 *\r\a" "\r\a" "22.75 *\r\a"
[21,] " \r\a" "\r\a" " \r\a" "\r\a" " \r\a" "\r\a"
[22,] "\r\a" "26.50 *\r\a" "\r\a" "27.00 *\r\a" "\r\a" "26.50 *\r\a"
[23,] "\r\a" "23.50 *\r\a" "\r\a" "24.00 *\r\a" "\r\a" "23.50 *\r\a"
[24,] " \r\a" "\r\a" " \r\a" "\r\a" " \r\a" "\r\a"
[25,] "26.25 *\r\a" "\r\a" "26.75 *\r\a" "\r\a" "26.25 *\r\a" NA
[26,] "26.25 *\r\a" "\r\a" "26.75 *\r\a" "\r\a" "26.25 *\r\a" NA
[27,] "26.00 *\r\a" "\r\a" "26.50 *\r\a" "\r\a" "26.00 *\r\a" NA
[28,] "28.75 *\r\a" "\r\a" "27.00 *\r\a" "\r\a" "27.50 *\r\a" "\r\a"
[29,] "25.50 *\r\a" "\r\a" "23.75 *\r\a" "\r\a" "24.25 *\r\a" "\r\a"
[30,] "\r\a" " \r\a" "\r\a" " \r\a" "\r\a" " \r\a"
[31,] "27.10 *\r\a" "\r\a" "25.35 *\r\a" "\r\a" "25.85 *\r\a" "\r\a"
[32,] "26.75 *\r\a" "\r\a" "25.00 *\r\a" "\r\a" "25.50 *\r\a" "\r\a"
[33,] "27.25 *\r\a" "\r\a" "25.50 *\r\a" "\r\a" "26.00 *\r\a" "\r\a"
[34,] "30.75 *\r\a" "\r\a" "29.00 *\r\a" "\r\a" "29.50 *\r\a" "\r\a"
[35,] "\r\a" " \r\a" "\r\a" " \r\a" "\r\a" " \r\a"
[36,] "28.25 *\r\a" "\r\a" "26.50 *\r\a" "\r\a" "27.00 *\r\a" "\r\a"
[37,] "\r\a" " \r\a" "\r\a" " \r\a" "\r\a" " \r\a"
[38,] "28.25 *\r\a" "\r\a" "26.50 *\r\a" "\r\a" "27.00 *\r\a" "\r\a"
[39,] "\r\a" " \r\a" "\r\a" " \r\a" "\r\a" " \r\a"
[40,] "27.50 *\r\a" "\r\a" "25.75 *\r\a" "\r\a" "26.25 *\r\a" "\r\a"
[41,] "27.50 *\r\a" "\r\a" "25.75 *\r\a" "\r\a" "26.25 *\r\a" "\r\a"
[42,] "27.25 *\r\a" "\r\a" "25.50 *\r\a" "\r\a" "26.00 *\r\a" "\r\a"
[43,] "27.25 *\r\a" "\r\a" "25.50 *\r\a" "\r\a" "26.00 *\r\a" "\r\a"
[44,] "\r\a" " \r\a" "\r\a" " \r\a" "\r\a" " \r\a"
[45,] "28.25 *\r\a" "\r\a" "26.50 *\r\a" "\r\a" "27.00 *\r\a" "\r\a"
[46,] "\r\a" " \r\a" "\r\a" " \r\a" "\r\a" " \r\a"
[47,] "27.75 *\r\a" "\r\a" "26.00 *\r\a" "\r\a" "26.50 *\r\a" "\r\a"
[48,] "\r\a" " \r\a" "\r\a" " \r\a" "\r\a" " \r\a"
[49,] "26.75 *\r\a" "\r\a" "25.00 *\r\a" "\r\a" "25.50 *\r\a" "\r\a"
[50,] "\r\a" " \r\a" "\r\a" " \r\a" "\r\a" " \r\a"
[51,] "26.55 *\r\a" "\r\a" "24.80 *\r\a" "\r\a" "25.30 *\r\a" "\r\a"
[52,] "23.90 *\r\a" "\r\a" "22.15 *\r\a" "\r\a" "22.65 *\r\a" "\r\a"
[53,] "\r\a" " \r\a" "\r\a" " \r\a" "\r\a" " \r\a"
[54,] "28.25 *\r\a" "\r\a" "26.50 *\r\a" "\r\a" "27.00 *\r\a" "\r\a"
[55,] "28.25 *\r\a" "\r\a" "26.50 *\r\a" "\r\a" "27.00\r\a" "\r\a"
[56,] "28.25 *\r\a" "\r\a" "26.50 *\r\a" "\r\a" "27.00\r\a" "\r\a"
[57,] "27.75 *\r\a" "\r\a" "26.00 *\r\a" "\r\a" "26.50\r\a" "\r\a"
[58,] "27.75 *\r\a" "\r\a" "26.00 *\r\a" "\r\a" "26.50\r\a" "\r\a"
[59,] "\r\a" " \r\a" "\r\a" "\r\a" "\r\a" " \r\a"
[60,] "28.75 *\r\a" "\r\a" "27.00 *\r\a" "\r\a" "27.50\r\a" "\r\a"
[61,] "27.25 *\r\a" "\r\a" "25.50 *\r\a" "\r\a" "26.00\r\a" "\r\a"
[62,] "27.75 *\r\a" "\r\a" "26.00 *\r\a" "\r\a" "26.50\r\a" "\r\a"
[63,] "24.25 *\r\a" "\r\a" "22.50 *\r\a" "\r\a" "23.00\r\a" "\r\a"
[64,] "\r\a" "22.50 *\r\a" "\r\a" "23.00\r\a" "\r\a" "22.50\r\a"
[,12]
[1,] NA
[2,] NA
[3,] NA
[4,] NA
[5,] NA
[6,] NA
[7,] NA
[8,] NA
[9,] NA
[10,] NA
[11,] NA
[12,] NA
[13,] NA
[14,] NA
[15,] NA
[16,] NA
[17,] NA
[18,] NA
[19,] NA
[20,] NA
[21,] NA
[22,] NA
[23,] NA
[24,] NA
[25,] NA
[26,] NA
[27,] NA
[28,] "27.00 *\r\a"
[29,] "23.75 *\r\a"
[30,] "\r\a"
[31,] "25.35 *\r\a"
[32,] "25.00 *\r\a"
[33,] "25.50 *\r\a"
[34,] "29.00 *\r\a"
[35,] "\r\a"
[36,] "26.50 *\r\a"
[37,] "\r\a"
[38,] "26.50 *\r\a"
[39,] "\r\a"
[40,] "25.75 *\r\a"
[41,] "25.75 *\r\a"
[42,] "25.50 *\r\a"
[43,] "25.50 *\r\a"
[44,] "\r\a"
[45,] "26.50 *\r\a"
[46,] "\r\a"
[47,] "26.00 *\r\a"
[48,] "\r\a"
[49,] "25.00 *\r\a"
[50,] "\r\a"
[51,] "24.80 *\r\a"
[52,] "22.15 *\r\a"
[53,] "\r\a"
[54,] "26.50 *\r\a"
[55,] "26.50\r\a"
[56,] "26.50\r\a"
[57,] "26.00\r\a"
[58,] "26.00\r\a"
[59,] "\r\a"
[60,] "27.00\r\a"
[61,] "25.50\r\a"
[62,] "26.00\r\a"
[63,] "22.50\r\a"
[64,] NA
Upvotes: 1