Reputation: 161

What is the most efficient way to handle working with many 25-80GB CSV datasets, summing up to about 10TB?

This question is a more fundamental question to a previous post of mine.

I would like to understand how to most-efficiently wrangle data in a scenario where I have many large (25-80GB) .csv datasets with the same column names (following some units over time) and I working in R. I have used a combination of {arrow} and {duckdb} before (see older post) but am open to other solutions, e.g. using {duckdb} alone. Resources to learn more about a suggested approach would also be very welcome.

Operations required will include filtering the data based on the occurrence of a certain string in a given column (e.g., string "Lewis" in a "book_author" column), collapsing (e.g., individual X booked 3 Lewis books in year Y), and rbind() the multiple, now cleaned, datasets at the end.

Once the data cleaning is over the file should be manageable in data.table.

PS: I realise reading some of the answers that I should have mentioned the following. I am working on a remote machine where I cannot install anything, but where I have (I think) all CRAN packages and the equivalent for Python. So solutions that require installation of other software are not viable for me--but still super interesting and useful to the broader community.

Upvotes: 3

Answers (4)

peak

Reputation: 116957

The following notes focus on getting to the point where you could use DuckDB (or something very much like DuckDB, such as Polars) either as a midpoint (e.g. if you want Parquet files) or as an endpoint.

If your CSV is sufficiently regular, you could probably get away with using DuckDB's CSV importing capabilities directly. I've had good experience with:

COPY (FROM '4GB.csv')
    TO '/tmp/4GB.snappy.parquet'
    (FORMAT PARQUET, CODEC 'SNAPPY', ROW_GROUP_SIZE 100000);"

(You can easily add filtering conditions in the FROM clause.)

In case DuckDB is problematic, here are some memory-efficient options for preparing the data with a view to using DuckDB or Polars.

By "memory-efficient" here I essentially mean "not requiring significantly more memory than is needed to store the longest line", i.e. stream-processing. There are two classes of such tools:

a) CSV-aware readers (e.g. PHP's fgetcsv, Python's csv.reader, and others listed below)

b) other memory-efficient processors, such as: awk, jq

The processors in (a) and (b) could of course be used together advantageously, e.g. by tasking PHP with conversion to TSV, and using awk next in the pipeline.

Some other memory-efficient CSV-aware tools worth considering include:

mlr (https://miller.readthedocs.io/en/latest/)
some components of csvkit (notably csvcut, csvclean, csvgrep)

Note that csvkit has builtin support for reading gzip, bz2 and xz (LZMA) compressed input files. (Similarly for DuckDB.)

To take advantage of the Python CSV module's ability to process lines in a streaming fashion, one can use csv.reader() along the following lines:

   csvreader = csv.reader(csvfile)
   # Iterate over each row in the CSV file
   for row in csvreader:
   ...

The above notes are obviously not comprehensive in any sense, but it's difficult to imagine that the task couldn't be accomplished using some combination of the tools that have been mentioned.

Upvotes: 1

Rui Barradas

Reputation: 76641

Here is a solution with package sqldf. It allows you to filter the data as it is read in with a SQL statement.

library(sqldf)
#> Loading required package: gsubfn
#> Loading required package: proto
#> Loading required package: RSQLite

sql <- "
SELECT *, COUNT(book_author) cnt
  FROM file 
  WHERE book_author = 'Lewis' AND individual = 'X'
  GROUP BY year
  HAVING COUNT(book_author) >= 3"

df_list <- lapply(filenames, read.csv.sql, sql = sql)

data.table::rbindlist(df_list, idcol = "id")
#>       id book_author individual  year   cnt
#>    <int>      <char>     <char> <int> <int>
#> 1:     2       Lewis          X  2004     3
#> 2:     7       Lewis          X  2011     3
#> 3:     8       Lewis          X  2000     3
#> 4:     8       Lewis          X  2018     3
#> 5:     9       Lewis          X  2007     4

^{Created on 2024-12-03 with reprex v2.1.1}

In the end of this test, run

unlink(filenames)

Test data

tmp_dir <- Sys.getenv("TEMP") |> chartr("\\", "/", x = _)
filenames <- sprintf("so_79246274_%d.csv", 1:9)
filenames <- file.path(tmp_dir, filenames)

n <- 1000
book_author <- c("Lewis", "Twain", "Dickens", "Eyre", "Scott")
individual <- c(LETTERS[1:9], "X")
year <- 2000:2024

set.seed(2024)
lapply(filenames, \(f) {
  df1 <- data.frame(
    book_author = sample(book_author, n, TRUE),
    individual = sample(individual, n, TRUE),
    year = sample(year, n, TRUE)
  )
  write.csv(df1, f, quote = FALSE, row.names = FALSE)
})
#> [[1]]
#> NULL
#> 
#> [[2]]
#> NULL
#> 
#> [[3]]
#> NULL
#> 
#> [[4]]
#> NULL
#> 
#> [[5]]
#> NULL
#> 
#> [[6]]
#> NULL
#> 
#> [[7]]
#> NULL
#> 
#> [[8]]
#> NULL
#> 
#> [[9]]
#> NULL

^{Created on 2024-12-03 with reprex v2.1.1}

Upvotes: 1

jblood94

Reputation: 17011

A couple strategies that can be used separately or combined:

Using readLines with fread, you can read the .csv files in chunks.
Convert the .csv files (possibly in chunks) into .fst files.

The idea is similar to the disk.frame package, which has been soft-deprecated, but the basic functionality is easy enough to do manually.

An example function that does both:

library(data.table)
library(fst)

csv_to_fst <- function(path, maxrow = 1e6, compress = 100) {
  fl <- file(path, "r")
  i <- 0L
  dt <- fread(text = readLines(fl, maxrow + 1))
  cols <- colnames(dt)
  
  while (nrow(dt)) {
    i <- i + 1L
    setnames(dt, cols)
    write.fst(dt, gsub(".csv$", paste0(i, ".fst"), path), compress)
    dt <- fread(text = readLines(fl, maxrow))
  }
  
  close(fl)
  i
}

With 100% compression, I've seen 10x+ size reduction, so the .fst files should take up considerably less space on disk. Reading .fst files is fast, and read.fst offers the additional ability to read in only specified columns and row ranges. [Side note: If you plan to do a lot of filtering on a particular column (like your "book_author" column), you could sort the data.tables by that column as you are creating the .fst files. Then, to filter, you could read in just the filtering column, find the span of rows needed, then read in just those rows for rest of the columns that you need.]

Then, say we had the following .csv file, we could convert it into multiple .fst files. Here, I break it into 15 files of 10 rows each:

path <- "C:/temp/iris.csv"
data(iris)
fwrite(iris, path)

(i <- 1:csv_to_fst(path, 10))
#>  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15

A simple function to operate on multiple .fst files and combine the results:

apply_fst <- function(paths, f) {
  rbindlist(lapply(paths, \(x) f(read.fst(x, as.data.table = TRUE))))
}

Example usage:

f <- function(dt) {
  dt[Sepal.Length < mean(Sepal.Length) & Sepal.Width > mean(Sepal.Width)]
}

apply_fst(sapply(i, \(i) gsub(".csv$", paste0(i, ".fst"), path)), f)
#>     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
#>            <num>       <num>        <num>       <num>     <char>
#>  1:          4.6         3.4          1.4         0.3     setosa
#>  2:          5.1         3.8          1.5         0.3     setosa
#>  3:          4.6         3.6          1.0         0.2     setosa
#>  4:          4.8         3.4          1.9         0.2     setosa
#>  5:          5.0         3.4          1.6         0.4     setosa
#>  6:          4.9         3.6          1.4         0.1     setosa
#>  7:          5.6         2.9          3.6         1.3 versicolor
#>  8:          5.6         3.0          4.5         1.5 versicolor
#>  9:          5.8         2.7          4.1         1.0 versicolor
#> 10:          5.9         3.2          4.8         1.8 versicolor
#> 11:          6.0         2.9          4.5         1.5 versicolor
#> 12:          5.4         3.0          4.5         1.5 versicolor
#> 13:          5.6         3.0          4.1         1.3 versicolor
#> 14:          6.3         3.3          6.0         2.5  virginica
#> 15:          6.5         3.0          5.8         2.2  virginica
#> 16:          6.5         3.2          5.1         2.0  virginica
#> 17:          6.4         3.2          5.3         2.3  virginica
#> 18:          6.5         3.0          5.5         1.8  virginica
#> 19:          6.1         3.0          4.9         1.8  virginica
#> 20:          6.3         3.4          5.6         2.4  virginica
#> 21:          6.4         3.1          5.5         1.8  virginica
#> 22:          6.2         3.4          5.4         2.3  virginica

You'll likely want to modify the above functions to meet your needs, but this should give you a good starting point.

Upvotes: 3

jared_mamrot

Reputation: 26695

If you have awk installed on your computer, one potential option is to filter your data 'out of memory' before loading it into R.

E.g. with two example csv files (test1.csv and test2.csv):

"test1.csv":

A,B,C,D
1,2,3,4
5,6,7,8

"test2.csv":

A,B,C,D
11,22,33,44
55,66,77,88

From within R:

library(data.table)

dt <- fread(cmd = "awk 'BEGIN{FS=\",\"} NR == 1 || $2 ~ \"2\"' ~/Desktop/test*.csv")
dt
#>        A     B     C     D
#>    <int> <int> <int> <int>
#> 1:     1     2     3     4
#> 2:    11    22    33    44

^{Created on 2024-12-03 with reprex v2.1.0}

The awk command is selecting rows where column 2 has a "2" in it (one row from each file). You can very likely handle the 'individual X booked 3 Lewis books in year Y' using awk, but I would need more details in order to include this 'collapsing' in the command (i.e. example input and output).

NB. This approach assumes your filtering sufficiently reduces the size of the resulting dataset (i.e. it fits into RAM after selecting the rows of interest), and that you have awk installed.

Upvotes: 2

What is the most efficient way to handle working with many 25-80GB CSV datasets, summing up to about 10TB?

Answers (4)

Test data

Related Questions