Reputation: 79
I want to load a 3.96 gigabyte tab separated value file to R and I have 8 ram in my system. How can I load this file to R to do some manipulation on it.
I tried library(data.table)
to load my data
but I´ve got this error message (Error: cannot allocate vector of size 965.7 Mb)
I also tried fread
with this code but it was not working either: it took a lot of time and at last it showed an error.
as.data.frame(fread(file name))
Upvotes: 0
Views: 3117
Reputation: 1328
If I were you, I probably would
1) try your fread
code once more without the typo (closing parenthesis was initially missing):
as.data.frame(fread(file name))
2) try to read the file in parts by specifying number of rows to read. This can be done in read.csv
and fread
with nrow
arguments. By reading a small number of rows one could check and confirm that the file is actually readable before doing anything else. Sometimes files are malformed, there could be some special characters, wrong end-of-line characters, escaping or something else which needs to be addressed first.
3) have a look at bigmemory
package which have read.big.matrix
function. Also ff
package has the desired functionalities.
Alternatively, I probably would also try to think "outside the box": do I need all of the data in the file? If not, I could preprocess the file for example with cut
or awk
to remove unnecessary columns. Do I absolutely need to read it as one file and have all data simultaneously in memory? If not, I could split the file or maybe use readLines
..
ps. This topic is covered quite nicely in this post.
pps. Thanks to @Yuriy Barvinchenko for comment on fread
Upvotes: 4
Reputation: 1595
From my experience and in addition to @Oka answer:
fread()
have nrows=
argument, so you can read first 10 lines. fread()[]
This way I worked with 5GB csv file.
Upvotes: 1
Reputation: 37641
You are reading the data (which puts it in memory) and then storing it as a data.frame (which makes another copy). Instead, read it directly into a data.frame with
fread(file name, data.table=FALSE)
Also, it wouldn't hurt to run garbage collection.
gc()
Upvotes: 1