Reputation: 251
I have a batch of text files that I need to read into r to do text mining.
So far, I have tried to use read.table, read.line, lapply, mcsv_r from qdap package to no avail. I have tried to write a loop to read the files, but I have to specify the name of the file, which changes in every iteration.
Here is what I have tried:
# Relative path points to the local folder
folder.path="../data/InauguralSpeeches/"
# get the list of file names
speeches=list.files(path = folder.path, pattern = "*.txt")
for(i in 1:length(speeches))
{
text_df <- do.call(rbind,lapply(speeches[i],read.csv))
}
Moreover, I have tried the following:
library(data.table)
files <- list.files(path = folder.path,pattern = ".csv")
temp <- lapply(files, fread, sep=",")
data <- rbindlist( temp )
And it is giving me this error when inaugAbrahamLincoln-1.csv clearly exists in the folder:
files <- list.files(path = folder.path,pattern = ".csv")
> temp <- lapply(files, fread, sep=",")
Error in FUN(X[[i]], ...) :
File 'inaugAbrahamLincoln-1.csv' does not exist. Include one or more spaces to consider the input a system command.
> data <- rbindlist( temp )
Error in rbindlist(temp) : object 'temp' not found
>
But it only works on .csv files, not on .txt files.
Is there a simpler way to do text mining from multiple sources files? If so how?
Thanks
Upvotes: 1
Views: 5714
Reputation: 109844
I often have this same problem. The textreadr package that I maintain is designed to make reading .csv, .pdf, .doc, and .docx documents and directories of these documents easy. It would reduce what you're doing to:
textreadr::read_dir("../data/InauguralSpeeches/")
Your example is not reproducible so I do it below (please make your example reproducible in the future).
library(textreadr)
## Minimal working example
dir.create('delete_me')
file.copy(dir(system.file("docs/Maas2011/pos", package = "textreadr"), full.names=TRUE), 'delete_me', recursive=TRUE)
write.csv(mtcars, 'delete_me/mtcars.csv')
write.csv(CO2, 'delete_me/CO2.csv')
cat('test\n\ntesting\n\ntester', file='delete_me/00_00.txt')
## the read in of a directory
read_dir('delete_me')
The output below shows the tibble output with each document registered in the document
column. For every line in the document there is one row for that document. Depending on what's in the csv files this may not be fine grained enough.
## document content
## 1 0_9 Bromwell High is a cartoon comedy. It ra
## 2 00_00 test
## 3 00_00
## 4 00_00 testing
## 5 00_00
## 6 00_00 tester
## 7 1_7 If you like adult comedy cartoons, like
## 8 10_9 I'm a male, not given to women's movies,
## 9 11_9 Liked Stanley & Iris very much. Acting w
## 10 12_9 Liked Stanley & Iris very much. Acting w
## .. ... ...
## 141 mtcars "Ferrari Dino",19.7,6,145,175,3.62,2.77,
## 142 mtcars "Maserati Bora",15,8,301,335,3.54,3.57,1
## 143 mtcars "Volvo 142E",21.4,4,121,109,4.11,2.78,18
Upvotes: 3
Reputation: 20302
Here is one way to do it.
library(data.table)
setwd("C:/Users/Excel/Desktop/CSV Files/")
WD="C:/Users/Excel/Desktop/CSV Files/"
# read headers
data<-data.table(read.csv(text="CashFlow,Cusip,Period"))
csv.list<- list.files(WD)
k=1
for (i in csv.list){
temp.data<-read.csv(i)
data<-data.table(rbind(data,temp.data))
if (k %% 100 == 0)
print(k/length(csv.list))
k<-k+1
}
Upvotes: 1
Reputation: 1221
Here is code that will read all the *.csv files in a directory to a single data.frame:
dir <- '~/Desktop/testcsv/'
files <- list.files(dir,pattern = '*.csv', full.names = TRUE)
data <- lapply(files, read.csv)
df <- do.call(rbind, data)
Notice that I added the argument full.names = TRUE
. This will give you the absolute paths, which is why youre getting an error for "inaugAbrahamLincoln-1.csv" even though it exists.
Upvotes: 2