user101998
user101998

Reputation: 251

read multiple text files into r for text mining purposes

I have a batch of text files that I need to read into r to do text mining.

So far, I have tried to use read.table, read.line, lapply, mcsv_r from qdap package to no avail. I have tried to write a loop to read the files, but I have to specify the name of the file, which changes in every iteration.

Here is what I have tried:

# Relative path points to the local folder
folder.path="../data/InauguralSpeeches/"

# get the list of file names
speeches=list.files(path = folder.path, pattern = "*.txt")

for(i in 1:length(speeches))
  {

    text_df <- do.call(rbind,lapply(speeches[i],read.csv))

}

Moreover, I have tried the following:

library(data.table)  
files <- list.files(path = folder.path,pattern = ".csv")
temp <- lapply(files, fread, sep=",")
data <- rbindlist( temp )

And it is giving me this error when inaugAbrahamLincoln-1.csv clearly exists in the folder:

files <- list.files(path = folder.path,pattern = ".csv")
> temp <- lapply(files, fread, sep=",")
Error in FUN(X[[i]], ...) : 
  File 'inaugAbrahamLincoln-1.csv' does not exist. Include one or more spaces to consider the input a system command.
> data <- rbindlist( temp )
Error in rbindlist(temp) : object 'temp' not found
> 

But it only works on .csv files, not on .txt files.

Is there a simpler way to do text mining from multiple sources files? If so how?

Thanks

Upvotes: 1

Views: 5714

Answers (3)

Tyler Rinker
Tyler Rinker

Reputation: 109844

I often have this same problem. The textreadr package that I maintain is designed to make reading .csv, .pdf, .doc, and .docx documents and directories of these documents easy. It would reduce what you're doing to:

textreadr::read_dir("../data/InauguralSpeeches/")

Your example is not reproducible so I do it below (please make your example reproducible in the future).

library(textreadr)

## Minimal working example
dir.create('delete_me')
file.copy(dir(system.file("docs/Maas2011/pos", package = "textreadr"), full.names=TRUE), 'delete_me', recursive=TRUE)
write.csv(mtcars, 'delete_me/mtcars.csv')
write.csv(CO2, 'delete_me/CO2.csv')
cat('test\n\ntesting\n\ntester', file='delete_me/00_00.txt')

## the read in of a directory
read_dir('delete_me') 

output

The output below shows the tibble output with each document registered in the document column. For every line in the document there is one row for that document. Depending on what's in the csv files this may not be fine grained enough.

##    document                                  content
## 1       0_9 Bromwell High is a cartoon comedy. It ra
## 2     00_00                                     test
## 3     00_00                                         
## 4     00_00                                  testing
## 5     00_00                                         
## 6     00_00                                   tester
## 7       1_7 If you like adult comedy cartoons, like 
## 8      10_9 I'm a male, not given to women's movies,
## 9      11_9 Liked Stanley & Iris very much. Acting w
## 10     12_9 Liked Stanley & Iris very much. Acting w
## ..      ...                                      ... 
## 141   mtcars "Ferrari Dino",19.7,6,145,175,3.62,2.77,
## 142   mtcars "Maserati Bora",15,8,301,335,3.54,3.57,1
## 143   mtcars "Volvo 142E",21.4,4,121,109,4.11,2.78,18

Upvotes: 3

ASH
ASH

Reputation: 20302

Here is one way to do it.

library(data.table)
setwd("C:/Users/Excel/Desktop/CSV Files/")

WD="C:/Users/Excel/Desktop/CSV Files/"
# read headers
data<-data.table(read.csv(text="CashFlow,Cusip,Period"))

csv.list<- list.files(WD)
k=1

for (i in csv.list){
  temp.data<-read.csv(i)
  data<-data.table(rbind(data,temp.data))

  if (k %% 100 == 0)
    print(k/length(csv.list))

  k<-k+1
}

Upvotes: 1

calico_
calico_

Reputation: 1221

Here is code that will read all the *.csv files in a directory to a single data.frame:

dir <- '~/Desktop/testcsv/'
files <- list.files(dir,pattern = '*.csv', full.names = TRUE)
data <- lapply(files, read.csv)
df <- do.call(rbind, data)

Notice that I added the argument full.names = TRUE. This will give you the absolute paths, which is why youre getting an error for "inaugAbrahamLincoln-1.csv" even though it exists.

Upvotes: 2

Related Questions