Reputation: 1875
I have around 400'000 objects that have an id
and a specific status
. Every day (365 days a year) a status file is generated as follows:
Example of a status file:
n = 400000
df <- data.frame(id = sample(200000000000:299999999999, size = n, replace = FALSE),
status = sample( seq(100, 900, 10), size = n, replace = TRUE))
time <- paste0(Sys.Date(), "_", format(Sys.time(), "%H%M"))
writexl::write_xlsx(df, paste0("status_", time, ".xlsx"))
Now, my question is: How can I most efficiently put these files together so that I can analyse the development of certain status over the time? To save some memory I didn't create a date variable within each status file as it would be the same for all cases (the date is just in the filename).
One status file is around 6 or 7 MB, so over a year this would sum up to around 2.5 GB.
Upvotes: 0
Views: 65
Reputation: 139
The easiest way would be to list the files, then read them in, and then join them.
So you could do something like this:
dir <- "path to directory"
files <- list.files(dir, full.names = TRUE)
# can change based on file type (if you convert to csv, you can use data.table::fread)
data <- purrr::map(files, readxl::read_excel)
# you'll probably want the date of each
dates <- dir %>% list.files() %>% stringr::str_remove(".xlsx") %>% stringr::str_remove("status_")
data <- purrr::map2(files, data, ~ mutate(.x, status = rep_len(.y, nrow(.x))
data_join <- purrr::reduce(data, rbind)
# you might want to convert the dates into date-times, here's an example
data_join <- data_join %>% separate(dates, into = c("date", "time"), by = "_") %>% mutate(date = lubridate::ymd(date), time = lubridate::hm(time)) %>% mutate(date_time = date + time)
This might not be the fastest solution in R, but is one of the fastest to get you running from the excel files.
Upvotes: 1