Philippe
Philippe

Reputation: 105

Optimizing vector indexing and matrix creation in R

first of all I am sorry if this topic has already been discussed somewhere else but I could not find anything relevant when searching.

My problem is the following: I have 4 vectors with partially overlapping names and I want to organize all these data into a matrix. I want the final matrix to have an entry for all the names present in at least one of the input vectors. I used the following code.

IDs <- unique(c(names(v1), names(v2), names(v3), names(v4))) 
mat <- matrix(c(v1[IDs], v2[IDs], v3[IDs], v4[IDs]), nrow=length(IDs), ncol=4)
mat[is.na(mat)] <- 0 
# This last line is to convert NAs generated when the entry isn't present in all vectors into 0 values. 

It works well but, as I have a total of entries > 2.2 millions, this is extremely slow (it took 2.5 days to run...). I am thus looking for a way to speed up the process.

I tried to use other structures (e.g. to create a data frame instead of a matrix) but without great improvement. After some tests, it seems that the bottleneck is the following step (even when considered individually):

v1[IDs]

which is repeated for each of the vectors (1 to 4). Note that typically only ~50% of the names overlap between two vectors (and therefore that only 50% of IDs/names used for indexing are initially present in the names of the vector).

I monitored a bit the CPU and memory used during the process and it seems that it is not a memory issue (the 6 free Gb remained free during the process).

I would appreciate any hint on how to make this process faster. As I have to repeat this process several times, I can't really afford to wait 2 days each time I have to generate such an object.

Thanks. =)

Philippe.

Upvotes: 4

Views: 176

Answers (1)

Jean-Robert
Jean-Robert

Reputation: 840

If you use the reshape2 package, the dcast function can do the job. First stack your vectors in a data.frame:

df <- rbind(data.frame(IDs=names(v1), value=v1, vec=1),
data.frame(IDs=names(v2), value=v2, vec=2),
data.frame(IDs=names(v3), value=v3, vec=3),
data.frame(IDs=names(v4), value=v4, vec=4))

Then transform this in a wide format:

dcast(df, ids ~ vec, value.var="value")

This outputs a data.frame but you can easily convert it back to a matrix

The speedup seems to increase as N grows: 5x faster with N=5000, 30x faster with N=10000, 67x faster with N=50000 where N is the length of v1.

Upvotes: 2

Related Questions