Atul Singh
Atul Singh

Reputation: 15

How to save Excel/CSV file with more than 1048576 records in R?

I have data with more than 1048576 records and want to save this file into Excel or CSV format in R Programming language? I know that excel sheet has restriction of 1048576 records but I am okay if the records can be appended in Other sheet? Is there any way to achieve this? Thanks

Upvotes: 1

Views: 1994

Answers (1)

Rui Barradas
Rui Barradas

Reputation: 76651

Both scripts to write as csv or xlsx start by setting the digits option to a bigger value (see this SO question) and to set a temporary directory to save and retrieve the files.

Write as CSV

Base function write.csv doesn't have a 1MB or 1,048,576 rows limit.

old_opts <- options(digits = 20)
old_dir <- getwd()
setwd("~/Temp")

# create a test data.frame
set.seed(2022)
# more than 1048576 rows
n <- 2^22
# two columns, one char, the other numeric
df1 <- data.frame(x = rep(letters, n%/%26), y = rnorm(n - 10L))
nrow(df1)
#> [1] 4194294

csv_test_file <- "so_q71553974_test.csv"
# write to disk and check its size and other info
write.csv(df1, csv_test_file, row.names = FALSE)
file.info(csv_test_file)
#>                           size isdir mode               mtime
#> so_q71553974_test.csv 97139168 FALSE  666 2022-03-21 08:13:42
#>                                     ctime               atime exe
#> so_q71553974_test.csv 2022-03-21 08:13:29 2022-03-21 08:13:42  no

# read the data from file and check if 
# the two data sets are equal
df2 <- read.csv(csv_test_file)

dim(df1)
#> [1] 4194294       2
dim(df2)
#> [1] 4194294       2

identical(df1, df2)
#> [1] FALSE
all.equal(df1, df2)
#> [1] TRUE

Created on 2022-03-21 by the reprex package (v2.0.1)

Final clean-up

unlink(csv_test_file)
options(old_opts)
setwd(old_dir)

Write as Excel file

Excel has a 1MB or 2^20 or 1048576 rows limit so in the code below I will split the data into sub-df's with less than 2^20 - 1 rows. I will subtract 2 to account for the column headers row and an extra row just to not be at the limit.
When tested for equality, the two data.frames have different classes. read_excel reads the file and outputs a tibble, which sub-classes "data.frame".

old_opts <- options(digits = 20)
old_dir <- getwd()
setwd("~/Temp")

# create a test data.frame
set.seed(2022)
# more than 1048576 rows
n <- 2^22
# two columns, one char, the other numeric
df1 <- data.frame(x = rep(letters, n%/%26), y = rnorm(n - 10L))
nrow(df1)
#> [1] 4194294


library(readxl)
library(writexl)

xl_test_file <- "so_q71553974_test.xlsx"

max_sheet_size <- 2^20 - 2L  # account for header row minus 1 to be safe
nsheets <- nrow(df1) %/% max_sheet_size + 1L
f <- rep(paste0("test_write_", seq.int(nsheets)), each = max_sheet_size, length.out = nrow(df1))

sp <- split(df1, f)
names(sp)
#> [1] "test_write_1" "test_write_2" "test_write_3" "test_write_4"
sapply(sp, nrow)
#> test_write_1 test_write_2 test_write_3 test_write_4 
#>      1048574      1048574      1048574      1048572
write_xlsx(sp, path = xl_test_file)

file.info(xl_test_file)
#>                            size isdir mode               mtime
#> so_q71553974_test.xlsx 89724869 FALSE  666 2022-03-21 08:28:54
#>                                      ctime               atime exe
#> so_q71553974_test.xlsx 2022-03-21 08:28:44 2022-03-21 08:28:54  no

# read the excel file
# since it has more than one sheet, loop through 
# the sheets and read them one by one
sheets <- excel_sheets(xl_test_file)
df2 <- lapply(sheets, \(s) read_excel(xl_test_file, sheet = s))

# bind all rows 
df2 <- do.call(rbind, df2)

dim(df1)
#> [1] 4194294       2
dim(df2)
#> [1] 4194294       2

identical(df1, df2)
#> [1] FALSE
all.equal(df1, df2)
#> [1] "Attributes: < Component \"class\": Lengths (1, 3) differ (string compare on first 1) >"
#> [2] "Attributes: < Component \"class\": 1 string mismatch >"

class(df1)
#> [1] "data.frame"
class(df2)
#> [1] "tbl_df"     "tbl"        "data.frame"

# final clean up
unlink(xl_test_file)
options(old_opts)
setwd(old_dir)

Created on 2022-03-21 by the reprex package (v2.0.1)

Upvotes: 1

Related Questions