xyz
xyz

Reputation: 79

How can I load a large (3.96 gb) .tsv file in R studio

I want to load a 3.96 gigabyte tab separated value file to R and I have 8 ram in my system. How can I load this file to R to do some manipulation on it.

I tried library(data.table) to load my data but I´ve got this error message (Error: cannot allocate vector of size 965.7 Mb)

I also tried fread with this code but it was not working either: it took a lot of time and at last it showed an error.

as.data.frame(fread(file name))

Upvotes: 0

Views: 3117

Answers (3)

Oka
Oka

Reputation: 1328

If I were you, I probably would

1) try your fread code once more without the typo (closing parenthesis was initially missing):

as.data.frame(fread(file name))

2) try to read the file in parts by specifying number of rows to read. This can be done in read.csv and fread with nrow arguments. By reading a small number of rows one could check and confirm that the file is actually readable before doing anything else. Sometimes files are malformed, there could be some special characters, wrong end-of-line characters, escaping or something else which needs to be addressed first.

3) have a look at bigmemory package which have read.big.matrix function. Also ff package has the desired functionalities.

Alternatively, I probably would also try to think "outside the box": do I need all of the data in the file? If not, I could preprocess the file for example with cut or awk to remove unnecessary columns. Do I absolutely need to read it as one file and have all data simultaneously in memory? If not, I could split the file or maybe use readLines..

ps. This topic is covered quite nicely in this post. pps. Thanks to @Yuriy Barvinchenko for comment on fread

Upvotes: 4

Yuriy Barvinchenko
Yuriy Barvinchenko

Reputation: 1595

From my experience and in addition to @Oka answer:

  1. fread() have nrows= argument, so you can read first 10 lines.
  2. If you found out that you don't need all lines and/or all columns, so you can set condition and list of fields just after fread()[]
  3. You can use data.table as dataframe in many cases, so you can try to read without as.data.frame()

This way I worked with 5GB csv file.

Upvotes: 1

G5W
G5W

Reputation: 37641

You are reading the data (which puts it in memory) and then storing it as a data.frame (which makes another copy). Instead, read it directly into a data.frame with

fread(file name, data.table=FALSE)

Also, it wouldn't hurt to run garbage collection.

gc()

Upvotes: 1

Related Questions