user1805343
user1805343

Reputation: 127

working with a list of dataframes as a list

I have two list of lists A and B, A and B contain 100 data frames each and the dimension of each data frame is 25000 X 25000. I would like to find the correlation for the entire data frame in the following way: Consider the first list in both lists and find cor (A,B) and get a single value correlating the entire dataframe. Similarly consider the second list in both lists and find cor(A,B) and continue this for the 100 dataframes.

I tried the following:

  A # list of 100 dataframes

  $1 ### dataframe 1
  $2
  $3
  ....
  $100   ### dataframe 100
  B #list of 100 dataframes

  $1 ### dataframe 1
  $2
  $3
  ....
  $100   ### dataframe 100
  C<- A[1] # extract only the first list from A
  D<- B[1] # extract only the first list from B

  C<-unlist(C) ### unlist C
  D<-unlist(D) ## unlist D

Then computed

   Correlation<- cor(C,D) ## to obtain a single correlation coefficient to see how these two vectors are correlated         

But I end up with the error sayin

  R cannot allocate a vector of size 3.9 GB

Is there a better way to do this in faster way which could be implemented to the entire list. I work on a server which allows me to compute large values but it still shows up this error and the unlisting takes ages because of the size of the dataframe.

Upvotes: 0

Views: 334

Answers (3)

mrip
mrip

Reputation: 15163

A few issues here. First of all, a dataframe may not be a good representation for matrices of size 25000x25000. Data frames typically have a small number of columns and a large number of rows. If every column is the same data type (which seems to be the case), then, depending on what else you need to do with the data, you might consider just working with matrices.

Next, the reason unlist takes a long time is that unlist appears to be implemented naively, essentially using repeated calls to c() (you can check the source to find out for sure). Try this instead:

C<-as.vector(as.matrix(C))

That should coerce C to a matrix in a more efficient way, and then simply drop the dimension attribute and you will get the vector you are looking for.

Next, you are dealing with a fairly large dataset, and the error you are getting means you are pushing the limits of the RAM you have available. Did you get the memory error during the call to unlist or during the call to cor? It would be helpful to provide the exact output of the R terminal.

I would suggest trying to do the computation using as.vector(as.matrix(C)) rather than unlist(C) and see if that works. If not, try garbage collecting (i.e. calling gc()) in between some of the calls.

As far as doing the operation to the whole list, you could simply use mapply. However, given that you are having memory issues, it might be a good idea to keep more control of exactly what is going on by writing less elegant imperative code. Something like this is simple enough:

corvec<-rep(0,100)
for(i in 1:100){
  gc()
  C<-as.vector(as.matrix(A[[i]]))
  D<-as.vector(as.matrix(B[[i]]))
  corvec[i]<-cor(C,D)
}

Upvotes: 1

sachinruk
sachinruk

Reputation: 9869

Not a direct answer, but to address the memory issue, you might want to increase the RAM allocated using memory.limit(8000) (8000MB)

Upvotes: 2

IRTFM
IRTFM

Reputation: 263481

You made 2 sublists but did not actually extract a dataframe or a vector.

Correlation<- cor(A[[1]][[1]], B[[1]][[1]])

The expression A[[1]] returns the first dataframe (if in fact the object was as you described it), and then the additional [[1]] returns the first column as an atomic vector so that it matches the requirements of the cor function. It's a bit unclear what you mean by either "the correlation for the entire data frame" or "faster way which could be implemented to the entire list." You could use lapply() or a for-loop to iterate over either the list of dataframes or over the columns of the dataframes. Why not make a list of 2 or 3 dataframes of more modest size and someone can show you how to do one or both of those methods. Or you could read some introductory material such as the "Introduction to R".

Upvotes: 4

Related Questions