Reputation: 207
I need to break a large file (14 gigabytes) into smaller files. The format of this file is txt, the tab is ";" and I know it has 70 columns (string, double). I would like to read 1million and save them in different files, file1, file2 ... fileN.
after the help of @MKR
but process was very slow, I tried using fread, but it was not possible.
How can I optimize this code?
NEW CODE
chunkSize <- 10000
conex <- file(description = db, open = "r")
data <- read.table(conex, nrows = chunkSize, header=T, fill=TRUE, sep =";")
index <- 0
counter <- 0
total <- 0
chunkSize <- 500000
conex <- file(description=db,open="r")
dataChunk <- read.table(conex, nrows=chunkSize, header=T, fill=TRUE,sep=";")
repeat {
dataChunk <- read.table(conex, nrows=chunkSize, header=FALSE, fill = TRUE, sep=";", col.names=db_colnames)
total <- total + sum(dataChunk$total)
counter <- counter + nrow(dataChunk)
write.table(dataChunk, file = paste0("MY_FILE_new",index),sep=";", row.names = FALSE)
if (nrow(dataChunk) != chunkSize){
print('linesok')
break}
index <- index + 1
print(paste('lines', index * chunkSize))
}
Upvotes: 5
Views: 3913
Reputation: 2213
You can consider the following approach which will split your file in chunks of 1000000 lines :
library(readr)
dir <- "D:\\"
file <- "my_File.txt"
file_Path <- paste0(dir, file)
con <- file(file_Path)
open(con, "rb")
counter <- 0
pagesize <- 1000000
repeat
{
page <- readr::read_lines(con, n_max = pagesize, num_threads = 10)
save(page, file = paste0(counter, "_.RData"))
counter <- counter + 1
if(length(page) < pagesize)
{
break
}
}
The above approach works when the file does not fit in the RAM. For example, it worked for a file of 130 Gig on a computer with 64 Gig of RAM. Also, the read_lines function of the R package readr is using multiple threads which can provide a speed-up.
Upvotes: 1
Reputation: 20085
You are perfectly on right track to achieve a solution.
The approach should be: 1. Read 1 million lines 2. Write to new files 3. Read next 1 million lines 4. Write to another new files
Lets convert the above logic in a loop in the line of OP's attempt:
index <- 0
counter <- 0
total <- 0
chunks <- 500000
repeat{
dataChunk <- read.table(con, nrows=chunks, header=FALSE, fill = TRUE,
sep=";", col.names=db_colnames)
# do processing on dataChunk (i.e adding header, converting data type)
# Create a new file name and write to it. You can have your own logic for file names
write.table(dataChunk, file = paste0("file",index))
#check if file end has been reached and break from repeat
if(nrow(dataChunk) < chunks){
break
}
#increment the index to read next chunk
index = index+1
}
Edited: Modified to add another option by reading file using data.table::fread
as requested by OP.
library(data.table)
index <- 0
counter <- 0
total <- 0
chunks <- 1000000
fileName <- "myfile"
repeat{
# With fread file is opened in each iteration
dataChunk <- fread(input = fileName, nrows=chunks, header=FALSE, fill = TRUE,
skip = chunks*index, sep=";", col.names=db_colnames)
# do processing on dataChunk (i.e adding header, converting data type)
# Create a new file name and write to it. You can have your own logic for file names
write.table(dataChunk, file = paste0("file",index))
#check if file end has been reached and break from repeat
if(nrow(dataChunk) < chunks){
break
}
#increment the index to read next chunk
index = index+1
}
Note: The above code is just pseudo code
of partial snippet to help OP. It will not run and produce result on its own.
Upvotes: 5
Reputation: 50668
Not an R-based answer, but in this case I recommend a shell-based solution using GNU's split
. This should be significantly faster than an R solution.
To split a file into chunks each with 10^6
lines, you'd do:
split -l 1000000 my_file.txt
For details on split
, see e.g. here.
Upvotes: 4