Reading a CSV file, looping through the rows, using connections

Question

So I have a large csv excel file that my computer cannot handle opening without rstudio terminating.

To solve this I am trying to iterate through the rows of the file in order do my calculations on each row at a time, before storing the value and then moving on to the next row.

This I can normally achieve (eg on a smaller file) through simply reading and storing the whole csv file within Rstudio and running a simple for loop.

It is, however, the size of this storage of data that I am trying to avoid, hence I am trying to read a row of the csv file one at a time instead.

(I think that makes sense)

This was suggested :here

I have managed to get my calculations to be read and work quickly for the first row of my data file.

It is the looping over this that I am struggling with, as I am trying to use a for loop (potentially should be using a while/if statement) but I have nowhere for the "i" value to be called from within the loop: part of my code is below:

con = file(FileName, "r")
  for (row in 1:nrow(con)) {
    data <- read.csv(con, nrow=1) #reading of file
 "insert calculations here"
}

So the "row" is not called upon so the loop only goes through once. I also have an issue with the "1:nrow(con)" as clearly the nrow(con) simply returns NULL

Any help with this would be great, thanks.

user2554330 · Accepted Answer

read.csv() will generate an error if it tries to read past the end of the file. So you could do something like this:

con <- file(FileName, "rt")
repeat {
   data <- try(read.csv(con, nrow = 1, header = FALSE), silent = TRUE) #reading of file
   if (inherits(data, "try-error")) break
   "insert calculations here"
}
close(con)

It will be really slow going one line at a time, but you can do it in larger batches if your calculation code supports that. And I'd recommend specifying the column types using colClasses in the read.csv() call, so that R doesn't guess differently sometimes.

Edited to add:

We've been told that there are 3000 columns of integers in the dataset. The first row only has partial header information. This code can deal with that:

n <- 1                           # desired batch size
col.names <- paste0("C", 1:3000) # desired column names
con <- file(FileName, "rt")
readLines(con, 1)                # Skip over bad header row
repeat {

   data <- try(read.csv(con, nrow = n, header = FALSE,
                        col.names = col.names,
                        colClasses = "integer"), 
               silent = TRUE) #reading of file
   if (inherits(data, "try-error")) break
   "insert calculations here"
}
close(con)

Reading a CSV file, looping through the rows, using connections

Answers (2)

Related Questions