LynnLatitude
LynnLatitude

Reputation: 33

How to create data frame for super large vectors? ​

I have 7 verylarge vectors, c1 to c7. My task is to simply create a data frame. However when I use data.frame(), error message returns.

> newdaily <- data.frame(c1,c2,c3,c4,c5,c6,c7)
Error in if (mirn && nrows[i] > 0L) { : 
  missing value where TRUE/FALSE needed
Calls: data.frame
In addition: Warning message:
In attributes(.Data) <- c(attributes(.Data), attrib) :
  NAs introduced by coercion to integer range
Execution halted

They all have the same length (2,626,067,374 elements), and I’ve checked there’s no NA.

I tried subsetting 1/5 of each vector and data.frame() function works fine. So I guess it has something to do with the length/size of the data? Any ideas how to fix this problem? Many thanks!!


Update both data.frame and data.table allow vectors shorter than 2^31-1. Stil can't find the solution to create one super large data.frame, so I subset my data instead... hope larger vectors will be allowed in the future.

Upvotes: 1

Views: 1148

Answers (2)

Pedro Henrique S
Pedro Henrique S

Reputation: 115

For that kind of data size, you must to optmize your memory, but how?

You need to write these values in a file.

   output_name = "output.csv"
   lines = paste(c1,c2,c3,c4,c5,c6,c7, collapse = ";")
   cat(lines, file = output_name , sep = "\n")

But probably you'll need to analyse them too, and (as it was said before) it requires a lot of memory.

So you have to read the file by their lines (like, 20k lines) by iteration to opmize your RAM memory, analyse these values, save their results and repeat..

    con = file(output_name )

    while(your_conditional) {
        lines_in_this_round = readLines(con, n = 20000)
        # create data.frame
        # analyse data
        # save result
        # update your_conditional
   }

I hope this helps you.

Upvotes: 0

Roland
Roland

Reputation: 132969

R's data.frames don't support such long vectors yet.

Your vectors are longer than 2^31 - 1 = 2147483647, which is the largest integer value that can be represented. Since the data.frame function/class assumes that the number of rows can be represented by an integer, you get an error:

x <- rep(1, 2626067374)
DF <- data.frame(x)
#Error in if (mirn && nrows[i] > 0L) { : 
#  missing value where TRUE/FALSE needed
#In addition: Warning message:
#In attributes(.Data) <- c(attributes(.Data), attrib) :
#  NAs introduced by coercion to integer range

Basically, something like this happens internally:

as.integer(length(x))
#[1] NA
#Warning message:
#  NAs introduced by coercion to integer range 

As a result the if condition becomes NA and you get the error.

Possibly, you could use the data.table package instead. Unfortunately, I don't have sufficient RAM to test:

library(data.table)
DT <- data.table(x = rep(1, 2626067374))
#Error: cannot allocate vector of size 19.6 Gb

Upvotes: 3

Related Questions