JonathanM
JonathanM

Reputation: 37

Filter CSV files for specific value before importing

I have a folder with thousands of comma delimited CSV files, totaling dozens of GB. Each file contains many records, which I'd like to separate and process separately based on the value in the first field (for example, aa, bb, cc, etc.).

Currently, I'm importing all the files into a dataframe and then subsetting in R into smaller, individual dataframes. The problem is that this is very memory intensive - I'd like to filter the first column during the import process, not once all the data is in memory.

This is my current code:

setwd("E:/Data/")
files <- list.files(path = "E:/Data/",pattern = "*.csv")
temp <- lapply(files, fread, sep=",", fill=TRUE, integer64="numeric",header=FALSE)
DF <- rbindlist(temp)
DFaa <- subset(DF, V1 =="aa")

If possible, I'd like to move the "subset" process into lapply.

Thanks

Upvotes: 1

Views: 834

Answers (4)

G. Grothendieck
G. Grothendieck

Reputation: 269461

1) read.csv.sql This will read a file directly into a temporarily set up SQLite database (which it does for you) and then only read the aa records into R. The rest of the file will not be read into R at any time. The table will then be deleted from the database.

File is a character string that contains the file name (or pathname if not in the current directory). Other arguments may be needed depending on the format of the data.

library(sqldf)

read.csv.sql(File, "select * from file where V1 == 'aa'", dbname = tempfile())

2) grep/findstr Another possibility is to use grep (Linux) or findstr (Windows) to extract the lines with aa. That should get you the desired lines plus possibly a few others and at that point you have a much smaller input so it could be subset it in R without memory problems. For example,

fread("findstr aa File")[V1 == 'aa'] # Windows
fread("grep aa File")[V1 == 'aa']    # Linux

sed or gawk could also be used and are included with Linux and in Rtools on Windows.

3) external csv utilties These external utilities understand csv and are free and available on all platforms that R supports: csvfix, csvkit, csvtk, miller and xsv (or the xsv fork qsv).

We show examples using fread with csvfix or xsv

# csvfix
fread("csvfix find -if $1==aa File")  # Windows
fread("csvfix find -if '$1'==aa File")  # Linux bash

# xsv (qsv has same syntax)
fread("xsv search aa --select 1 File.csv File")[V1 == "aa"]

Upvotes: 3

Carl Witthoft
Carl Witthoft

Reputation: 21492

If you don't want to muck with SQL, consider using the skip argument in a loop. Slower, but that way you read in a block of lines, filter them, then read in the next block of lines (to the same temp variable so as not to take extra memory), etc.
Inside your lapply call, either a second lapply or equivalently

for (jj in 0: N) {
    foo <-  fread(filename, skip = (jj*1000+1):((jj+1)*1000), sep=",", fill=TRUE, integer64="numeric",header=FALSE)
    mydata[[jj]] <- do_something_to_filter(foo)
}

Upvotes: 0

iod
iod

Reputation: 7592

setwd("E:/Data/")
files <- list.files(path = "E:/Data/",pattern = "*.csv")
temp <- lapply(files, function(x) subset(fread(x, sep=",", fill=TRUE, integer64="numeric",header=FALSE), V1=="aa"))
DF <- rbindlist(temp)

Untested, but this will probably work - replace your function call with an anonymous function.

Upvotes: 1

Duck
Duck

Reputation: 39595

This could help but you have to expand the function:

#Function
myload <- function(x) 
{
  y <- fread(x, sep=",", fill=TRUE, integer64="numeric",header=FALSE)
  y <- subset(y, V1 =="aa")
  return(y)
}
#Apply
temp <- lapply(files, myload)

Upvotes: 0

Related Questions