Reputation: 25
I have several folders with ca 10000 small csv files that I'd like to quickly read into memory and stitch together into one data frame per folder. readr's read_csv can do this conveniently since it accepts vectors of file paths directly and does the combining for me. However, it crashes, when I use want to read more than a couple files.
What is the best way around this issue?
reproducible example inspired from read_csv:
continents <- c("africa", "americas", "asia", "europe", "oceania")
filepaths <- vapply(
paste0("mini-gapminder-", continents, ".csv"),
FUN = readr_example,
FUN.VALUE = character(1)
)
filepaths_10k <- rep(filepaths, 2000)
# works
read_csv(filepaths, id = "file")
# doesnt
read_csv(filepaths_10k, id = "file")
I get the following error:
Error in file(con, "r") : cannot open the connection
In addition: Warning message:
In file(con, "r") :
cannot open file '/usr/lib/rstudio/resources/CITATION': Too many open files
Error in file(con, "rb") : cannot open the connection
In addition: Warning message:
In file(con, "rb") :
cannot open file '/home/simon/R/x86_64-pc-linux-gnu-library/3.6/readr/extdata/mini-gapminder-asia.csv': Too many open files
Edit: I have a version of the code using lapply, read_csv and rbindlist, but that did not even finish when I let it run over night. So speed is part of the story here and some microbenchmarks I have run suggest that the above approach is much faster.
Edit2: As per the suggestions (thanks!) I have run some more benchmarks myself. It seems to me there the main difference is whether I am relying on some "explicit" way of binding together the files or whether that is done under the hood by readr. Explicitly setting readr to not use lazy evaluation doesn't seem to make a difference in terms of speed and it also doesn't fix the error. So the suggestion it could be an OS specific thing may be correct (I am on ubuntu 20.04). Also, readr switched the default back to eager evaluation, so this is expected (I had to check anyways...). Also, I am not sure I want lazy eval in the first place, since I am combining all files and do some more cleaning steps anyhow.
microbenchmark(l_apply_rbindlist = lapply(filepaths, read_csv) %>% rbindlist(),
l_apply_bindrows = lapply(filepaths, read_csv) %>% bind_rows(),
read_csv_map = map_df(filepaths, ~read_csv(.)),
readr_default = read_csv(filepaths),
readr_eager_expl = read_csv(filepaths, lazy = FALSE),
times = 10,
check = "equivalent")
Unit: milliseconds
expr min lq mean median uq max neval cld
l_apply_rbindlist 214.08594 219.90338 223.36077 222.36070 227.47078 232.48656 10 b
l_apply_bindrows 225.47465 232.00539 235.62815 234.78071 239.32159 249.53793 10 b
read_csv_map 215.86775 225.37601 229.41726 231.70719 232.17263 240.49416 10 b
readr_default 57.66125 59.77418 77.79516 60.41160 69.10050 214.88023 10 a
readr_eager_expl 56.21319 57.05472 61.06905 62.67377 63.66434 64.61471 10 a
Upvotes: 2
Views: 438
Reputation: 11878
You could try to batch the files to read_csv()
calls so that you stay
below the open files limit of your system, but still benefit from the speed
of specifying multiple files to read in one call.
library(readr)
continents <- c("africa", "americas", "asia", "europe", "oceania")
filepaths <- readr_example(
paste0("mini-gapminder-", continents, ".csv")
)
BATCH_SIZE <- 3
op <- options(readr.show_col_types = FALSE)
batch_no <- (seq_along(filepaths) - 1) %/% BATCH_SIZE
split(filepaths, batch_no) |> purrr::map_df(read_csv)
#> # A tibble: 26 × 5
#> country year lifeExp pop gdpPercap
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Algeria 1952 43.1 9279525 2449.
#> 2 Angola 1952 30.0 4232095 3521.
#> 3 Benin 1952 38.2 1738315 1063.
#> 4 Botswana 1952 47.6 442308 851.
#> 5 Burkina Faso 1952 32.0 4469979 543.
#> 6 Burundi 1952 39.0 2445618 339.
#> 7 Argentina 1952 62.5 17876956 5911.
#> 8 Bolivia 1952 40.4 2883315 2677.
#> 9 Brazil 1952 50.9 56602560 2109.
#> 10 Canada 1952 68.8 14785584 11367.
#> # … with 16 more rows
options(op)
Upvotes: 2