Reputation: 15
I have data with more than 1048576 records and want to save this file into Excel or CSV format in R Programming language? I know that excel sheet has restriction of 1048576 records but I am okay if the records can be appended in Other sheet? Is there any way to achieve this? Thanks
Upvotes: 1
Views: 1994
Reputation: 76651
Both scripts to write as csv or xlsx start by setting the digits
option to a bigger value (see this SO question) and to set a temporary directory to save and retrieve the files.
Base function write.csv
doesn't have a 1MB or 1,048,576 rows limit.
old_opts <- options(digits = 20)
old_dir <- getwd()
setwd("~/Temp")
# create a test data.frame
set.seed(2022)
# more than 1048576 rows
n <- 2^22
# two columns, one char, the other numeric
df1 <- data.frame(x = rep(letters, n%/%26), y = rnorm(n - 10L))
nrow(df1)
#> [1] 4194294
csv_test_file <- "so_q71553974_test.csv"
# write to disk and check its size and other info
write.csv(df1, csv_test_file, row.names = FALSE)
file.info(csv_test_file)
#> size isdir mode mtime
#> so_q71553974_test.csv 97139168 FALSE 666 2022-03-21 08:13:42
#> ctime atime exe
#> so_q71553974_test.csv 2022-03-21 08:13:29 2022-03-21 08:13:42 no
# read the data from file and check if
# the two data sets are equal
df2 <- read.csv(csv_test_file)
dim(df1)
#> [1] 4194294 2
dim(df2)
#> [1] 4194294 2
identical(df1, df2)
#> [1] FALSE
all.equal(df1, df2)
#> [1] TRUE
Created on 2022-03-21 by the reprex package (v2.0.1)
Final clean-up
unlink(csv_test_file)
options(old_opts)
setwd(old_dir)
Excel has a 1MB or 2^20 or 1048576 rows limit so in the code below I will split the data into sub-df's with less than 2^20 - 1 rows. I will subtract 2 to account for the column headers row and an extra row just to not be at the limit.
When tested for equality, the two data.frames have different classes. read_excel
reads the file and outputs a tibble, which sub-classes "data.frame"
.
old_opts <- options(digits = 20)
old_dir <- getwd()
setwd("~/Temp")
# create a test data.frame
set.seed(2022)
# more than 1048576 rows
n <- 2^22
# two columns, one char, the other numeric
df1 <- data.frame(x = rep(letters, n%/%26), y = rnorm(n - 10L))
nrow(df1)
#> [1] 4194294
library(readxl)
library(writexl)
xl_test_file <- "so_q71553974_test.xlsx"
max_sheet_size <- 2^20 - 2L # account for header row minus 1 to be safe
nsheets <- nrow(df1) %/% max_sheet_size + 1L
f <- rep(paste0("test_write_", seq.int(nsheets)), each = max_sheet_size, length.out = nrow(df1))
sp <- split(df1, f)
names(sp)
#> [1] "test_write_1" "test_write_2" "test_write_3" "test_write_4"
sapply(sp, nrow)
#> test_write_1 test_write_2 test_write_3 test_write_4
#> 1048574 1048574 1048574 1048572
write_xlsx(sp, path = xl_test_file)
file.info(xl_test_file)
#> size isdir mode mtime
#> so_q71553974_test.xlsx 89724869 FALSE 666 2022-03-21 08:28:54
#> ctime atime exe
#> so_q71553974_test.xlsx 2022-03-21 08:28:44 2022-03-21 08:28:54 no
# read the excel file
# since it has more than one sheet, loop through
# the sheets and read them one by one
sheets <- excel_sheets(xl_test_file)
df2 <- lapply(sheets, \(s) read_excel(xl_test_file, sheet = s))
# bind all rows
df2 <- do.call(rbind, df2)
dim(df1)
#> [1] 4194294 2
dim(df2)
#> [1] 4194294 2
identical(df1, df2)
#> [1] FALSE
all.equal(df1, df2)
#> [1] "Attributes: < Component \"class\": Lengths (1, 3) differ (string compare on first 1) >"
#> [2] "Attributes: < Component \"class\": 1 string mismatch >"
class(df1)
#> [1] "data.frame"
class(df2)
#> [1] "tbl_df" "tbl" "data.frame"
# final clean up
unlink(xl_test_file)
options(old_opts)
setwd(old_dir)
Created on 2022-03-21 by the reprex package (v2.0.1)
Upvotes: 1