Reputation: 831
I have a very large .csv file (~4GB) which I'd like to read, then subset.
The problem comes at reading (memory allocation error). Being that large reading crashes, so what I'd like is a way to subset the file before or while reading it, so that it only gets the rows for one city (Cambridge).
f:
id City Value
1 London 17
2 Coventry 21
3 Cambridge 14
......
I've already tried the usual approaches:
f <- read.csv(f, stringsAsFactors=FALSE, header=T, nrows=100)
f.colclass <- sapply(f,class)
f <- read.csv(f,sep = ",",nrows = 3000000, stringsAsFactors=FALSE,
header=T,colClasses=f.colclass)
which seem to work for up to 1-2M rows, but not for the whole file.
I've also tried subsetting at the reading itself using pipe:
f<- read.table(file = f,sep = ",",colClasses=f.colclass,stringsAsFactors = F,pipe('grep "Cambridge" f ') )
and this also seems to crash.
I thought packages sqldf or data.table would have something, but no success yet !!
Thanks in advance, p.
Upvotes: 0
Views: 5930
Reputation: 10223
Although there's nothing wrong with the existing answers, they miss the most conventional/common way of dealing with this: chunks (Here's an example from one of the multitude of similar questions/answers).
The only difference is, unlike for most of the answers that load the whole file, you would read it chunk by chunk and only keep the subset you need at each iteration
# open connection to file (mostly convenience)
file_location = "C:/users/[insert here]/..."
file_name = 'name_of_file_i_wish_to_read.csv'
con <- file(paste(file_location, file_name,sep='/'), "r")
# set chunk size - basically want to make sure its small enough that
# your RAM can handle it
chunk_size = 1000 # the larger the chunk the more RAM it'll take but the faster it'll go
i = 0 # set i to 0 as it'll increase as we loop through the chunks
# loop through the chunks and select rows that contain cambridge
repeat {
# things to do only on the first read-through
if(i==0){
# read in columns only on the first go
grab_header=TRUE
# load the chunk
tmp_chunk = read.csv(con, nrows = chunk_size,header=grab_header)
# subset only to desired criteria
cond = tmp_chunk[,'City'] == "Cambridge"
# initiate container for desired data
df = tmp_chunk[cond,] # save desired subset in initial container
cols = colnames(df) # save column names to re-use on next chunks
}
# things to do on all subsequent non-first chunks
else if(i>0){
grab_header=FALSE
tmp_chunk = read.csv(con, nrows = chunk_size,header=grab_header,col.names = cols)
# set stopping criteria for the loop
# when it reads in 0 rows, exit loop
if(nrow(tmp_chunk)==0){break}
# subset only to desired criteria
cond = tmp_chunk[,'City'] == "Cambridge"
# append to existing dataframe
df = rbind(df, tmp_chunk[cond,])
}
# add 1 to i to avoid the things needed to do on the first read-in
i=i+1
}
close(con) # close connection
# check out the results
head(df)
Upvotes: 1
Reputation: 61
I think this was alluded to already but just in case it wasn't completely clear. The sqldf package creates a temporary SQLite DB on your machine based on the csv file and allows you to write SQL queries to perform subsets of the data before saving the results to a data.frame
library(sqldf)
query_string <- "select * from file where City=='Cambridge' "
f <- read.csv.sql(file = "f.csv", sql = query_string)
#or rather than saving all of the raw data in f, you may want to perform a sum
f_sum <- read.csv.sql(file = "f.csv",
sql = "select sum(Value) from file where City=='Cambridge' " )
Upvotes: 3
Reputation: 388
One solution to this type of error is
I am working with a datasets of size more than 20 GB this way.
Upvotes: 1