Reputation: 261
Someone can help me to let me know how to read the pdf file, which is including some tables. I want to extract the data in the table, and arrange to csv file.
Thanks a lot
Upvotes: 25
Views: 57317
Reputation: 7260
A well-described step-by-step from the University of Virginia you'll find at Reading PDF files into R for text mining with the pdftools
package. Some information I extracted is below.
Please follow the installation notes described in the link above.
With that done, you’re ready to use readPDF
to create your function to read in PDF files. You can name the function whatever you like, e.g, Rpdf.
library(pdftools)
Rpdf <- readPDF(control = list(text = "-layout"))
The readPDF
function has a control argument that we use to pass options to our PDF extraction engine. This has to be in the form of a list, so we wrap our options in the list function. There are two control parameters for the xpdf engine: info and text. info passes parameters to pdfinfo.exe and text passes parameters to pdftotext.exe. We only pass one parameter setting to pdftotext
: “-layout.” This tells pdftptext.exe to maintain (as best as possible) the original physical layout of the text.
Using the Rpdf function, we can proceed to read the text of the opinions. What we want to do is convert the PDF files to text and store them in a corpus, which is basically a database for text. We can do all that with the following code:
opinions <- Corpus(URISource(files), readerControl = list(reader = Rpdf))
Upvotes: 9
Reputation: 17689
I realize this question is older, but i thought reproducible examples might not hurt:
library(pdftools)
pdftools::pdf_text(pdf = "http://arxiv.org/pdf/1403.2805.pdf")
Offline version:
pdf(file = "tmp.pdf")
plot(1, main = "mytext")
dev.off()
pdftools::pdf_text(pdf = "tmp.pdf")
I come back to this question from time to time and even though the current answer is great, i always hope to find reproducible code. So i thought i add it. It can be removed if not needed.
Upvotes: 30