read_csv crashes because "too many files open"

Question

I have several folders with ca 10000 small csv files that I'd like to quickly read into memory and stitch together into one data frame per folder. readr's read_csv can do this conveniently since it accepts vectors of file paths directly and does the combining for me. However, it crashes, when I use want to read more than a couple files.

What is the best way around this issue?

reproducible example inspired from read_csv:

continents <- c("africa", "americas", "asia", "europe", "oceania")
filepaths <- vapply(
  paste0("mini-gapminder-", continents, ".csv"),
  FUN = readr_example,
  FUN.VALUE = character(1)
)
filepaths_10k <- rep(filepaths, 2000)
# works
read_csv(filepaths, id = "file")
# doesnt
read_csv(filepaths_10k, id = "file")

I get the following error:

Error in file(con, "r") : cannot open the connection                                                                                    
In addition: Warning message:
In file(con, "r") :
  cannot open file '/usr/lib/rstudio/resources/CITATION': Too many open files
Error in file(con, "rb") : cannot open the connection                                                                                   
In addition: Warning message:
In file(con, "rb") :
  cannot open file '/home/simon/R/x86_64-pc-linux-gnu-library/3.6/readr/extdata/mini-gapminder-asia.csv': Too many open files

Edit: I have a version of the code using lapply, read_csv and rbindlist, but that did not even finish when I let it run over night. So speed is part of the story here and some microbenchmarks I have run suggest that the above approach is much faster.

Edit2: As per the suggestions (thanks!) I have run some more benchmarks myself. It seems to me there the main difference is whether I am relying on some "explicit" way of binding together the files or whether that is done under the hood by readr. Explicitly setting readr to not use lazy evaluation doesn't seem to make a difference in terms of speed and it also doesn't fix the error. So the suggestion it could be an OS specific thing may be correct (I am on ubuntu 20.04). Also, readr switched the default back to eager evaluation, so this is expected (I had to check anyways...). Also, I am not sure I want lazy eval in the first place, since I am combining all files and do some more cleaning steps anyhow.

microbenchmark(l_apply_rbindlist = lapply(filepaths, read_csv) %>% rbindlist(),
               l_apply_bindrows = lapply(filepaths, read_csv) %>% bind_rows(),
               read_csv_map = map_df(filepaths, ~read_csv(.)),
               readr_default = read_csv(filepaths),
               readr_eager_expl = read_csv(filepaths, lazy = FALSE),
               times = 10,
               check = "equivalent")

Unit: milliseconds
              expr       min        lq      mean    median        uq       max neval cld
 l_apply_rbindlist 214.08594 219.90338 223.36077 222.36070 227.47078 232.48656    10   b
  l_apply_bindrows 225.47465 232.00539 235.62815 234.78071 239.32159 249.53793    10   b
      read_csv_map 215.86775 225.37601 229.41726 231.70719 232.17263 240.49416    10   b
     readr_default  57.66125  59.77418  77.79516  60.41160  69.10050 214.88023    10  a 
  readr_eager_expl  56.21319  57.05472  61.06905  62.67377  63.66434  64.61471    10  a

read_csv crashes because "too many files open"

Answers (1)

Related Questions

read_csv crashes because &quot;too many files open&quot;

Answers (1)

Related Questions

read_csv crashes because "too many files open"