CT_369
CT_369

Reputation: 89

Writing a huge R dataframe into 4 separate files in R

I have a tibble/dataframe in R with about 206 million records and 5 columns. My system runs out of memory if I do any further analysis/computation on this data. Hence, I want to write this tibble into 4 separate csv files (to disk) of ~50 million records each (last one would be ~56 million) and proceed with further computation/analysis in 4 separate iterations. I searched a few threads on the web could not find any suitable to this usecase.

How can I achieve this?

Upvotes: 0

Views: 129

Answers (3)

Mikael Jagan
Mikael Jagan

Reputation: 11326

There are two base R approaches that you can try, if for some reason data.table is not an option. The first approach is to write the full data frame (here called df) to disk, free the memory occupied by df, then read and write the subsets one at a time using readLines and writeLines.

nms <- names(df)
n <- nrow(df)

write.table(df, file = "df.csv", sep = ",", row.names = FALSE, col.names = FALSE)
rm(df); gc(FALSE)

start <- seq(from = 1L, to = n, by = 5e+07L)
end <- c(start[-1L] - 1L, n)

rcon <- file("df.csv", "r")
for (i in seq_along(start)) {
    wcon <- file(sprintf("df_%d_%d.csv", start[i], end[i]), "w")
    cat(nms, wcon, sep = ",")
    cat("\n", wcon)
    writeLines(readLines(rcon, n = end[i] - start[i] + 1L), wcon)
    close(wcon)
}
close(rcon)
unlink("df.csv")

The second (probably much worse) approach is to write the variables in df to disk individually, free the memory occupied by df, then read the variables into memory one at a time as needed to construct and write the subsets.

nms <- names(df)
n <- nrow(df)

for (nm in nms) {
    saveRDS(df[[nm]], file = paste0(nm, ".rds"))
}
rm(df); gc(FALSE)

start <- seq(from = 1L, to = n, by = 5e+07L)
end <- c(start[-1L] - 1L, n)

for (i in seq_along(start)) {
    l <- setNames(vector("list", length(nms)), nms)
    for (nm in nms) {
        l[[nm]] <- readRDS(paste0(nm, ".rds"))[start[i]:end[i]]
    }
    write.csv(l, file = sprintf("df_%d_%d.csv", start[i], end[i]), row.names = FALSE)
}
unlink(paste0(nms, ".rds"))

Ideally, you are running the above in a new R process with only df in the global environment.

Upvotes: 0

alejandro_hagan
alejandro_hagan

Reputation: 1003

Apologies if this solution misses the mark, but I believe the below should work:

df %>% #name of dataframe
slice(1:5.0e7) %>% #first 50M rows
write_csv("file_a.csv") #save as csv

and repeat for the remaining sets just changing the reference for slice() and the file name in write_csv()

Upvotes: -1

Sweepy Dodo
Sweepy Dodo

Reputation: 1873

Let us know if your machine has the memory for the below. This is to achieve OP's goal (request) to split then save original df into 4 separate files

library(data.table)
setDT(df)

# dummy data
df <- data.table(row_id = 1:123)

# parameters
x <- nrow(df)  # nrow of df
y <- 4    # no. of splits

# create batch number
df[, batch := rep(1:y, each=x/y, length.out=x)]

# split
df <- split(df, by='batch')

# save as separate csv
lapply( df, \(i) fwrite(i, file = paste0( i[1][1,'batch'], '.csv')) )

Upvotes: 1

Related Questions