D. Studer
D. Studer

Reputation: 1875

R: efficiently reading in files and joining them

I have around 400'000 objects that have an id and a specific status. Every day (365 days a year) a status file is generated as follows:

Example of a status file:

n = 400000
df <- data.frame(id = sample(200000000000:299999999999, size = n, replace = FALSE),
                 status = sample( seq(100, 900, 10), size = n, replace = TRUE))

time <- paste0(Sys.Date(), "_", format(Sys.time(), "%H%M"))

writexl::write_xlsx(df, paste0("status_", time, ".xlsx"))

Now, my question is: How can I most efficiently put these files together so that I can analyse the development of certain status over the time? To save some memory I didn't create a date variable within each status file as it would be the same for all cases (the date is just in the filename).

One status file is around 6 or 7 MB, so over a year this would sum up to around 2.5 GB.

Upvotes: 0

Views: 65

Answers (1)

Robert Schauner
Robert Schauner

Reputation: 139

The easiest way would be to list the files, then read them in, and then join them.

So you could do something like this:

dir <- "path to directory"
files <- list.files(dir, full.names = TRUE)

# can change based on file type (if you convert to csv, you can use data.table::fread)
data <- purrr::map(files, readxl::read_excel)

# you'll probably want the date of each
dates <- dir %>% list.files() %>% stringr::str_remove(".xlsx") %>% stringr::str_remove("status_")
data <- purrr::map2(files, data, ~ mutate(.x, status = rep_len(.y, nrow(.x))

data_join <- purrr::reduce(data, rbind)

# you might want to convert the dates into date-times, here's an example

data_join <- data_join %>% separate(dates, into = c("date", "time"), by = "_") %>% mutate(date = lubridate::ymd(date), time = lubridate::hm(time)) %>% mutate(date_time = date + time)

This might not be the fastest solution in R, but is one of the fastest to get you running from the excel files.

Upvotes: 1

Related Questions