Reputation: 735
I have a file with the size of 53 Gb and here's its head:
1 10 2873
1 100 22246
1 1000 28474
1 10000 35663
1 10001 35755
1 10002 35944
1 10003 36387
1 10004 36453
1 10005 36758
1 10006 37240
I'm running R 3.3.2 on a CentOS7 64-bit server with RAM of 128 Gb. I've read 4098 similar files into R. However, I can't read the largest one into R.
df <- read.table(f, header=FALSE, col.names=c('a', 'b', 'dist'), sep='\t', quote='', comment.char='')
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : har='')
too many items
It returns error saying "too many items". Then I followed this tip:
df5rows <- read.table(f, nrows=5, header=FALSE, col.names=c('a', 'b', 'dist'), sep='\t', quote='', comment.char='')
classes <- sapply(df5rows, class)
df <- read.table(f, nrows=3231959401, colClass=classes, header=FALSE, col.names=c('a', 'b', 'dist'), sep='\t', quote='', comment.char='')
It still says "too many items", and "NAs are introduced". I also tried without colClasses
, same result:
df <- read.table(f, nrows=3231959401, header=FALSE, col.names=c('a', 'b', 'dist'), sep='\t', quote='', comment.char='')
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : har='')
too many items
In addition: Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
NAs introduced by coercion to integer range
The memory used never went over 90 Gb (when without any nrows
or colClasses
, with those args it never went over 60 Gb). I don't understand why R can't read the file.
I've also checked that there's no line with 4 or more columns.
Upvotes: 3
Views: 681
Reputation: 189
Did you try to cut the file using a light editor such as (sed or VI)? Then you just have to merge the two dataset. On a very similar machine with big file, I experienced the same problem. Its was a junk line, with regard of the size of the file those kind of errors occurs.
Upvotes: 1