Bammers
Bammers

Reputation: 47

How do I convert multiple pdf's into a corpus for text analysis in R?

I have a very basic question because I'm an absolute beginner. I've tried to find help online and read different tutorials and handbooks, but can't find the answer.

My project is very simple. I have dozens of pdf's (stored in a folder) that I want to analyse for their contents (unsupervised learning). The ultimate goal is a topic analysis. Now here's the problem: every guide I can find jumps right into pre-processing of these texts without going over the first steps of loading these files into R and defining the corpus.

So, basically, I want to break down all these pdf's in a dataframe for analysis but I'm missing the first step of loading these in R.

Any help would be greatly appreciated.

Upvotes: 2

Views: 3237

Answers (1)

phiver
phiver

Reputation: 23598

There are multiple ways, but if you want to get it into a corpus there is a simple way to do it. It does require that the package pdftools is installed (install.packages("pdftools")) as that will be the engine used to read the pdfs. Then it is just a question of using the tm package to read everything into a corpus.

library(tm)

directory <- getwd() # change this to directory where files are located

# read the pdfs with readPDF, default engine used is pdftools see ?readPDF for more info
my_corpus <- VCorpus(DirSource(directory, pattern = ".pdf"), 
                               readerControl = list(reader = readPDF))

Upvotes: 2

Related Questions