Reputation: 83
So I have a large csv excel file that my computer cannot handle opening without rstudio terminating.
To solve this I am trying to iterate through the rows of the file in order do my calculations on each row at a time, before storing the value and then moving on to the next row.
This I can normally achieve (eg on a smaller file) through simply reading and storing the whole csv file within Rstudio and running a simple for loop.
It is, however, the size of this storage of data that I am trying to avoid, hence I am trying to read a row of the csv file one at a time instead.
(I think that makes sense)
This was suggested :here
I have managed to get my calculations to be read and work quickly for the first row of my data file.
It is the looping over this that I am struggling with, as I am trying to use a for loop (potentially should be using a while/if statement) but I have nowhere for the "i" value to be called from within the loop: part of my code is below:
con = file(FileName, "r")
for (row in 1:nrow(con)) {
data <- read.csv(con, nrow=1) #reading of file
"insert calculations here"
}
So the "row"
is not called upon so the loop only goes through once. I also have an issue with the "1:nrow(con)"
as clearly the nrow(con)
simply returns NULL
Any help with this would be great, thanks.
Upvotes: 0
Views: 3370
Reputation: 44907
read.csv()
will generate an error if it tries to read past the end of the file. So you could do something like this:
con <- file(FileName, "rt")
repeat {
data <- try(read.csv(con, nrow = 1, header = FALSE), silent = TRUE) #reading of file
if (inherits(data, "try-error")) break
"insert calculations here"
}
close(con)
It will be really slow going one line at a time, but you can do it in larger batches if your calculation code supports that. And I'd recommend specifying the column types using colClasses
in the read.csv()
call, so that R doesn't guess differently sometimes.
Edited to add:
We've been told that there are 3000 columns of integers in the dataset. The first row only has partial header information. This code can deal with that:
n <- 1 # desired batch size
col.names <- paste0("C", 1:3000) # desired column names
con <- file(FileName, "rt")
readLines(con, 1) # Skip over bad header row
repeat {
data <- try(read.csv(con, nrow = n, header = FALSE,
col.names = col.names,
colClasses = "integer"),
silent = TRUE) #reading of file
if (inherits(data, "try-error")) break
"insert calculations here"
}
close(con)
Upvotes: 1
Reputation: 6776
You could read in your data in batches of, say, 10,000 rows at a time (but you can change n to do as much as you want), do your calculations and then write the changes to a new file, appending the each batch to the end of the file.
Something like:
i = 0
n = 10000
while (TRUE) {
df = readr::read_csv('my_file.csv', skip=i, n_max=n)
# If the number of rows in the file is divisible by n, it may be the case
# that the next pass will result in an empty data.frame being returned
if (nrow(df) > 0) {
# do your calculations
# If you have performed calculations on df and want to save those results,
# save the data.frame to a file, appending it to the file to avoid overwriting prior results.
readr::write_csv(df, 'my_new_file.csv', append=TRUE)
} else {
break
}
# Check to see if we need to keep going, if so add n to i
if (nrow(df) < n) {
break
} else {
i = i + n
}
}
Upvotes: 1