Reputation: 89
I have a tibble/dataframe in R with about 206 million records and 5 columns. My system runs out of memory if I do any further analysis/computation on this data. Hence, I want to write this tibble into 4 separate csv files (to disk) of ~50 million records each (last one would be ~56 million) and proceed with further computation/analysis in 4 separate iterations. I searched a few threads on the web could not find any suitable to this usecase.
How can I achieve this?
Upvotes: 0
Views: 129
Reputation: 11326
There are two base R approaches that you can try, if for some reason data.table
is not an option. The first approach is to write the full data frame (here called df
) to disk, free the memory occupied by df
, then read and write the subsets one at a time using readLines
and writeLines
.
nms <- names(df)
n <- nrow(df)
write.table(df, file = "df.csv", sep = ",", row.names = FALSE, col.names = FALSE)
rm(df); gc(FALSE)
start <- seq(from = 1L, to = n, by = 5e+07L)
end <- c(start[-1L] - 1L, n)
rcon <- file("df.csv", "r")
for (i in seq_along(start)) {
wcon <- file(sprintf("df_%d_%d.csv", start[i], end[i]), "w")
cat(nms, wcon, sep = ",")
cat("\n", wcon)
writeLines(readLines(rcon, n = end[i] - start[i] + 1L), wcon)
close(wcon)
}
close(rcon)
unlink("df.csv")
The second (probably much worse) approach is to write the variables in df
to disk individually, free the memory occupied by df
, then read the variables into memory one at a time as needed to construct and write the subsets.
nms <- names(df)
n <- nrow(df)
for (nm in nms) {
saveRDS(df[[nm]], file = paste0(nm, ".rds"))
}
rm(df); gc(FALSE)
start <- seq(from = 1L, to = n, by = 5e+07L)
end <- c(start[-1L] - 1L, n)
for (i in seq_along(start)) {
l <- setNames(vector("list", length(nms)), nms)
for (nm in nms) {
l[[nm]] <- readRDS(paste0(nm, ".rds"))[start[i]:end[i]]
}
write.csv(l, file = sprintf("df_%d_%d.csv", start[i], end[i]), row.names = FALSE)
}
unlink(paste0(nms, ".rds"))
Ideally, you are running the above in a new R process with only df
in the global environment.
Upvotes: 0
Reputation: 1003
Apologies if this solution misses the mark, but I believe the below should work:
df %>% #name of dataframe
slice(1:5.0e7) %>% #first 50M rows
write_csv("file_a.csv") #save as csv
and repeat for the remaining sets just changing the reference for slice()
and the file name in write_csv()
Upvotes: -1
Reputation: 1873
Let us know if your machine has the memory for the below. This is to achieve OP's goal (request) to split then save original df
into 4 separate files
library(data.table)
setDT(df)
# dummy data
df <- data.table(row_id = 1:123)
# parameters
x <- nrow(df) # nrow of df
y <- 4 # no. of splits
# create batch number
df[, batch := rep(1:y, each=x/y, length.out=x)]
# split
df <- split(df, by='batch')
# save as separate csv
lapply( df, \(i) fwrite(i, file = paste0( i[1][1,'batch'], '.csv')) )
Upvotes: 1