Reputation: 73
I try to import a lot of data into R. So much that R breaks down after 15 min of importing.
I therefore need to break up the importing of data into intervals. Below is how I have done it for one interval from 101-200 calling the list ALL200.
However, I'm not sure how to automate this, as I need to set the interval to the next 100 each time?
ALL200 <- list() # creates a list
listcsv <- dir(pattern = "*.csv") # creates the list of all the csv files in the directory
#make a list in R with all the stocks
for (k in 101:200){
ALL200[[k]] <- read_csv(listcsv[k],
col_types = cols(expirDate = col_date(format = "%Y-%m-%d"),
trade_date = col_date(format = "%Y-%m-%d")))
Hope you can help me out.
Upvotes: 0
Views: 90
Reputation: 8516
I would give data.table
a try:
library(data.table)
library(fasttime)
## generate mock files
set.seed(1)
bigdt <- data.table(expirDate = paste(sample(1980:2020, 1e6, replace = T),
sample(1:12, 1e6, replace = T),
sample(1:28, 1e6, replace = T),
sep = "-"),
trade_date = paste(sample(1980:2020, 1e6, replace = T),
sample(1:12, 1e6, replace = T),
sample(1:28, 1e6, replace = T),
sep = "-"))
biglist <- split(bigdt, ceiling(seq_len(dim(bigdt)[1])/1e3))
invisible(lapply(seq_along(biglist),
function(x) fwrite(biglist[[x]],
file=paste0("datefile_", sprintf("%04d", x), ".csv"))))
## read files in chunks of 100
system.time({ ## for timing
listcsv <- dir(pattern = "date.*csv")
listcsv <- split(listcsv, ceiling(seq_along(listcsv)/100))
importFiles <- function(x){
dt <- setNames(lapply(listcsv[[x]], fread), listcsv[[x]])
dt <- rbindlist(dt, idcol = "File")
dt[, c("expirDate", "trade_date") := lapply(.SD, fastPOSIXct, "GMT"), .SDcols=c("expirDate", "trade_date")][]
# maybe do additional filtering, removal of columns, etc.
}
bigdt <- rbindlist(lapply(seq_along(listcsv), importFiles))
})
#> user system elapsed
#> 0.572 0.033 0.607
bigdt
#> File expirDate trade_date
#> 1: datefile_0001.csv 1983-12-03 2002-07-25
#> 2: datefile_0001.csv 2018-03-24 1998-07-09
#> 3: datefile_0001.csv 1980-08-21 1985-11-05
#> 4: datefile_0001.csv 2013-10-20 2011-11-03
#> 5: datefile_0001.csv 2002-10-15 1996-05-25
#> ---
#> 999996: datefile_1000.csv 1998-03-05 1986-11-08
#> 999997: datefile_1000.csv 1984-01-13 2004-05-21
#> 999998: datefile_1000.csv 1991-12-20 1989-09-14
#> 999999: datefile_1000.csv 2005-03-24 2015-06-04
#> 1000000: datefile_1000.csv 2007-04-22 1996-07-06
Created on 2020-04-23 by the reprex package (v0.3.0)
Upvotes: 0
Reputation: 16998
I'm not sure if this solves your question, but try using a list of lists:
A <- list()
listcsv <- dir(pattern = "*.csv")
for (i in 1:10) {
B <- list()
for (k in ( (i*100 + 1):((i+1)*100)){
B[[k]] <- read_csv(listcsv[k],
col_types = cols(expirDate = col_date(format = "%Y-%m-%d"),
trade_date = col_date(format = "%Y-%m-%d")))
}
A[[i]] <- B
}
Upvotes: 0