Reputation: 47
I have a very basic question because I'm an absolute beginner. I've tried to find help online and read different tutorials and handbooks, but can't find the answer.
My project is very simple. I have dozens of pdf's (stored in a folder) that I want to analyse for their contents (unsupervised learning). The ultimate goal is a topic analysis. Now here's the problem: every guide I can find jumps right into pre-processing of these texts without going over the first steps of loading these files into R and defining the corpus.
So, basically, I want to break down all these pdf's in a dataframe for analysis but I'm missing the first step of loading these in R.
Any help would be greatly appreciated.
Upvotes: 2
Views: 3237
Reputation: 23598
There are multiple ways, but if you want to get it into a corpus there is a simple way to do it. It does require that the package pdftools is installed (install.packages("pdftools")
) as that will be the engine used to read the pdfs. Then it is just a question of using the tm package to read everything into a corpus.
library(tm)
directory <- getwd() # change this to directory where files are located
# read the pdfs with readPDF, default engine used is pdftools see ?readPDF for more info
my_corpus <- VCorpus(DirSource(directory, pattern = ".pdf"),
readerControl = list(reader = readPDF))
Upvotes: 2