Reputation: 1
everyone
I am desperately trying to convert 29.677 .txt files into one single .csv file with one column for the file names and one column for the text in the files. But when running the code on a sample size of 10 .txt files, I get the following error message:
Error in data.frame(filename = basename(myfiles), fulltext = mytxts1lines) : arguments imply differing number of rows: 10, 150194
Any help would be greatly appreciated as I'm going insane...
I tried to run the following function to make it happen: `
txt2csv <- function(my_dir, mycsvfilename) {
starting_dir <- getwd()
myfiles <- list.files(mydir, full.names = TRUE, pattern = "*.txt")
mytxts <- lapply(myfiles, readLines)
mytxts1lines <- unlist(mytxts)
mytxtsdf <- data.frame(filename = basename(myfiles),
fulltext = mytxts1lines)
setwd(mydir)
write.table(mytxtsdf, file = paste0(mycsvfilename, ".csv"), sep = ",", row.names = FALSE, col.names = FALSE, quote = FALSE)
message(paste0("newspapers_csv_file", paste0(newspapers_csv_file, ".csv"), "C:\\Users\\[...]", getwd()))
setwd(starting_dir)
}
`
Upvotes: 0
Views: 21
Reputation: 20522
When you say filename = basename(myfiles)
, you are asking R to create a column called filename
, of length length(myfiles)
, which is 10. Then when you create the next column of 150194 rows, R cannot reconcile the two.
What you want to do is to create a filename
vector of the same length as mytxts
and make that the column name. We can do that with the rep()
function, using the lengths()
function to define how many times to repeat each filename.
Here is a reproducible example:
mytxts <- list(
letters[1:5],
letters[1:10],
letters[1:3]
)
myfiles <- c("file1", "file2", "file3")
mytxts1lines <- unlist(mytxts)
myfiles1lines <- rep(myfiles,lengths(mytxts))
data.frame(
filename = myfiles1lines,
fulltext = mytxts1lines
)
# filename fulltext
# 1 file1 a
# 2 file1 b
# 3 file1 c
# 4 file1 d
# 5 file1 e
# 6 file2 a
# 7 file2 b
# 8 file2 c
# 9 file2 d
# 10 file2 e
# 11 file2 f
# 12 file2 g
# 13 file2 h
# 14 file2 i
# 15 file2 j
# 16 file3 a
# 17 file3 b
# 18 file3 c
In your case, you will want the argument to rep()
to be basename(myfiles)
.
txt2csv <- function(my_dir, mycsvfilename) {
starting_dir <- getwd()
myfiles <- list.files(mydir, full.names = TRUE, pattern = "*.txt")
mytxts <- lapply(myfiles, readLines)
mytxts1lines <- unlist(mytxts)
myfiles1lines <- rep(basename(myfiles),lengths(mytxts)) # this line is new
mytxtsdf <- data.frame(filename = myfiles1lines, # this line is changed
fulltext = mytxts1lines)
setwd(mydir)
write.table(mytxtsdf, file = paste0(mycsvfilename, ".csv"), sep = ",", row.names = FALSE, col.names = FALSE, quote = FALSE)
message(paste0("newspapers_csv_file", paste0(newspapers_csv_file, ".csv"), "C:\\Users\\[...]", getwd()))
setwd(starting_dir)
}
Upvotes: 1