r- rollapply across a mutiple file database

Question

I have a large database that I've split into multiple files. Each file is saved in the same directory, and there is a numerical sequence in the naming scheme so the order of the database is maintained. Ive done this to reduce the time and memory it takes to load and manipulate the database. I would like to start analyzing the database in sequence, which I intend to accomplish using a rollapply like function. I am having a problem when I want the window to span two files at once. Which is where I need help. Here is dummy dataset that will create five CSV files with a similar naming scheme to my database:

library(readr)

val <- c(1,2,3,4,5)
df_1 <- data.frame(val)

write_csv(df_1, "1_database.csv", col_names = TRUE)
write_csv(df_1, "2_database.csv", col_names = TRUE)
write_csv(df_1, "3_database.csv", col_names = TRUE)
write_csv(df_1, "4_database.csv", col_names = TRUE)
write_csv(df_1, "5_database.csv", col_names = TRUE)

Keep in mind that this database is huge, and causes memory and time issues on my current machine. The solution MUST have a component that "forgets". This means recurrently joining the files, or loading them all at once to the R environment is not an option. When a new file is loaded, the last file must be removed from the R environment. I can have at maximum three files loaded at once. For example files 1-3 can be loaded, and then file 1 needs to be removed before file 4 is loaded.

The output can be a single list of all files - the combination of files 1-5 in a single list.

For the sake of simplicity, lets say I want to use a window of 2, and I want to calculate the mean of this window. I'm imagining something like this (see below) but this maybe a failed approach, and I'm open to anything.

appreciated_function <- function(x){

           Your greatly appreciated function
}

rollapply(df, 2, appreciated_function, by.column = FALSE, align = "left")

G. Grothendieck · Accepted Answer

Suppose the window width is k. Iterate through all files and for each one read that file plus the first k-1 rows of the next (except for the last) and use rollapply on that appending what we get to what we have so far. Alternately, if the output is too large we could write out each result instead of appending it.

At the bottom we check that it gives the expected result.

library(readr)
library(zoo)

val <- c(1,2,3,4,5)
df_1 <- data.frame(val)

write_csv(df_1, "1_database.csv", col_names = TRUE)
write_csv(df_1, "2_database.csv", col_names = TRUE)
write_csv(df_1, "3_database.csv", col_names = TRUE)
write_csv(df_1, "4_database.csv", col_names = TRUE)
write_csv(df_1, "5_database.csv", col_names = TRUE)

d <- dir(pattern = "database.csv$")

k <- 2
r <- NULL
for(i in seq_along(d)) {
   Next <- if (i != length(d)) read_csv(d[i+1], n_max = k-1)
   DF <- rbind(read_csv(d[i]), Next)
   r0 <- rollapply(DF, k, sum, align = "left")
   # if output too large replace next statement with one to write out r0
   r <- rbind(r, r0)
}

# check
r2 <- rollapply(data.frame(val = sequence(rep(5, 5))), k, sum, align = "left")
identical(r, r2)
## [1] TRUE

r- rollapply across a mutiple file database

Answers (1)

Related Questions