user3310782
user3310782

Reputation: 831

Subset large .csv file at reading in R

I have a very large .csv file (~4GB) which I'd like to read, then subset.

The problem comes at reading (memory allocation error). Being that large reading crashes, so what I'd like is a way to subset the file before or while reading it, so that it only gets the rows for one city (Cambridge).

f:
        id City      Value
        1  London     17
        2  Coventry   21
        3  Cambridge  14
        ......

I've already tried the usual approaches:

f <- read.csv(f, stringsAsFactors=FALSE, header=T, nrows=100)

f.colclass <- sapply(f,class)

f <- read.csv(f,sep = ",",nrows = 3000000, stringsAsFactors=FALSE, 
                  header=T,colClasses=f.colclass)

which seem to work for up to 1-2M rows, but not for the whole file.

I've also tried subsetting at the reading itself using pipe:

f<- read.table(file = f,sep = ",",colClasses=f.colclass,stringsAsFactors   = F,pipe('grep "Cambridge" f ') )

and this also seems to crash.

I thought packages sqldf or data.table would have something, but no success yet !!

Thanks in advance, p.

Upvotes: 0

Views: 5930

Answers (3)

Gene Burinsky
Gene Burinsky

Reputation: 10223

Although there's nothing wrong with the existing answers, they miss the most conventional/common way of dealing with this: chunks (Here's an example from one of the multitude of similar questions/answers).

The only difference is, unlike for most of the answers that load the whole file, you would read it chunk by chunk and only keep the subset you need at each iteration

# open connection to file (mostly convenience)
file_location = "C:/users/[insert here]/..."
file_name = 'name_of_file_i_wish_to_read.csv'
con <- file(paste(file_location, file_name,sep='/'), "r")

# set chunk size - basically want to make sure its small enough that
# your RAM can handle it
chunk_size = 1000 # the larger the chunk the more RAM it'll take but the faster it'll go    

i = 0 # set i to 0 as it'll increase as we loop through the chunks

# loop through the chunks and select rows that contain cambridge
repeat {
  # things to do only on the first read-through 
  if(i==0){
    # read in columns only on the first go
    grab_header=TRUE
    # load the chunk
    tmp_chunk = read.csv(con, nrows = chunk_size,header=grab_header)
    
    # subset only to desired criteria
    cond = tmp_chunk[,'City'] == "Cambridge"
    
    # initiate container for desired data
    df = tmp_chunk[cond,] # save desired subset in initial container
    cols = colnames(df) # save column names to re-use on next chunks
  }

  # things to do on all subsequent non-first chunks
  else if(i>0){
    grab_header=FALSE
    tmp_chunk = read.csv(con, nrows = chunk_size,header=grab_header,col.names = cols)

    # set stopping criteria for the loop
    # when it reads in 0 rows, exit loop
    if(nrow(tmp_chunk)==0){break}
    
    # subset only to desired criteria
    cond = tmp_chunk[,'City'] == "Cambridge"        
    
    # append to existing dataframe
    df = rbind(df, tmp_chunk[cond,])
  }
  # add 1 to i to avoid the things needed to do on the first read-in
  i=i+1
}

close(con) # close connection

# check out the results
head(df)

Upvotes: 1

Joe R
Joe R

Reputation: 61

I think this was alluded to already but just in case it wasn't completely clear. The sqldf package creates a temporary SQLite DB on your machine based on the csv file and allows you to write SQL queries to perform subsets of the data before saving the results to a data.frame

library(sqldf)

query_string <- "select * from file where City=='Cambridge' "
f <- read.csv.sql(file = "f.csv", sql = query_string)

#or rather than saving all of the raw data in f, you may want to perform a sum

f_sum <- read.csv.sql(file = "f.csv",
 sql = "select sum(Value) from file where City=='Cambridge' " )

Upvotes: 3

Pankaj Sharma
Pankaj Sharma

Reputation: 388

One solution to this type of error is

  1. you can convert your csv file to excel file first.
  2. Then you can map your excel file into mysql table by using toad for mysql it is easy.Just check for datatype of variables.
  3. then using RODBC package you can access such a large dataset.

I am working with a datasets of size more than 20 GB this way.

Upvotes: 1

Related Questions