user13232877
user13232877

Reputation: 205

Is there any method to extract pdf table tidy with R?

I need an automatic code to extract pdf table in R.

So I searched website, find tabulizer package.

and I use

extract_tables(f2,pages = 25,guess=TRUE,encoding = 'UTF-8',method="stream")#f2 is pdf file name

I tried every method type, but the outcome is not tidy.

Some columns are mixed and there is a lot of blank as you can see image file.

I think I would do modify the data directly. But the purpose is automizing it. So general method is needed. And every pdf file is not organized. Some table is very tidy with every related line matched perfectly but others are not.. As you can see in my outcome image, in column 4, the number is mixed in same column. Other columns, the number is matched one by one what I mean is I want to make column tidy like table in pdf automatically.

Is there any package or some method to make extracted table tidy?

my Code result

table in PDF

Upvotes: 0

Views: 1343

Answers (1)

Emmanuel Hamel
Emmanuel Hamel

Reputation: 2233

With the following code, I have been able to extract the numbers in the table. First, I converted the image to a PDF file. Afterwards, I converted the PDF file to a word file. I finally extracted the tables from the word file. This solution only works on Windows.

library(RDCOMClient)
library(magick)

path_PDF <- "D:\\image_Stackoverflow79.pdf"
path_PNG <- "D:\\Dropbox\\Reponses_Stackoverflow\\image_Stackoverflow79.png"
path_Word <- "D:\\image_Stackoverflow79.docx"

pdf(path_PDF, height = 8, width = 6)
im <- image_read(path_PNG)
im <- image_crop(im, geometry = geometry_area(width = 510, height = 310, x_off = 100, y_off = 110))
plot(im)
dev.off()

wordApp <- COMCreate("Word.Application")
wordApp[["Visible"]] <- TRUE
wordApp[["DisplayAlerts"]] <- FALSE

doc <- wordApp[["Documents"]]$Open(normalizePath(path_PDF),
                                   ConfirmConversions = FALSE)

doc$SaveAs2(path_Word)


nb_Row <- doc$tables(1)$Rows()$Count()
nb_Col <- doc$tables(1)$Columns()$Count()
mat_Temp <- matrix(NA, nrow = nb_Row, ncol = nb_Col)

for(i in 1 : nb_Row)
{
  for(j in 1 : nb_Col)
  {
    mat_Temp[i, j] <- tryCatch(doc$tables(1)$cell(i, j)$range()$text(), error = function(e) NA)
  }
}

mat_Temp 

[,1]   [,2]        [,3]         [,4]         [,5]        [,6]        [,7]        [,8]  
 [1,] "\r\a" "\r\a"      "\r\a"       "\r\a"       "\r\a"      "\r\a"      "\r\a"      "\r\a"
 [2,] "\r\a" "0.46\r\a"  "0.46\r\a"   "0.46\r\a"   "0.46\r\a"  "0.46\r\a"  "0.46\r\a"  "\r\a"
 [3,] "\r\a" "1.00\r\a"  "0.00\r\a"   "0.98\r\a"   "0.03\r\a"  "0.95\r\a"  "0.85\r\a"  NA    
 [4,] "\r\a" "0.025\r\a" "0.025\r\a"  "0.025\r\a"  "0.025\r\a" "0.025\r\a" "0.025\r\a" NA    
 [5,] "\r\a" "0.005\r\a" "0.005\r\a"  "0.005\r\a"  "0.005\r\a" "0.005\r\a" "0.005\r\a" NA    
 [6,] "\r\a" "1.49\r\a"  "0.49\r\a"   "1.47\r\a"   "0.52\r\a"  "1.44\r\a"  "1.34\r\a"  "\r\a"
 [7,] "\r\a" "0.002\r\a" "0.002\r\a"  "0.002\r\a"  "0.002\r\a" "0.002\r\a" "0.002\r\a" "\r\a"
 [8,] "\r\a" "1.492\r\a" "0.492\r\a"  "1472\r\a"   "0.522\r\a" "1.442\r\a" "1.342\r\a" "\r\a"
 [9,] "\r\a" "1.59\r\a"  "\r\a"       "1.22\r\a"   "\r\a"      "\r\a"      "\r\a"      "\r\a"
[10,] "\r\a" "1.493\r\a" "0.493\r\a"  "1473\r\a"   "0.523\r\a" "1.443\r\a" "1.343\r\a" "\r\a"
[11,] "\r\a" "0.107\r\a" "o. 108\r\a" "o. 105\r\a" "0.108\r\a" "0.106\r\a" "0.104\r\a" "\r\a"
[12,] "\r\a" "\r\a"      "\r\a"       NA           NA          NA          NA          NA         

With this approach, the numbers seem to be in the good columns.

Upvotes: 0

Related Questions