Reputation: 105
first of all I am sorry if this topic has already been discussed somewhere else but I could not find anything relevant when searching.
My problem is the following: I have 4 vectors with partially overlapping names and I want to organize all these data into a matrix. I want the final matrix to have an entry for all the names present in at least one of the input vectors. I used the following code.
IDs <- unique(c(names(v1), names(v2), names(v3), names(v4)))
mat <- matrix(c(v1[IDs], v2[IDs], v3[IDs], v4[IDs]), nrow=length(IDs), ncol=4)
mat[is.na(mat)] <- 0
# This last line is to convert NAs generated when the entry isn't present in all vectors into 0 values.
It works well but, as I have a total of entries > 2.2 millions, this is extremely slow (it took 2.5 days to run...). I am thus looking for a way to speed up the process.
I tried to use other structures (e.g. to create a data frame instead of a matrix) but without great improvement. After some tests, it seems that the bottleneck is the following step (even when considered individually):
v1[IDs]
which is repeated for each of the vectors (1 to 4). Note that typically only ~50% of the names overlap between two vectors (and therefore that only 50% of IDs/names used for indexing are initially present in the names of the vector).
I monitored a bit the CPU and memory used during the process and it seems that it is not a memory issue (the 6 free Gb remained free during the process).
I would appreciate any hint on how to make this process faster. As I have to repeat this process several times, I can't really afford to wait 2 days each time I have to generate such an object.
Thanks. =)
Philippe.
Upvotes: 4
Views: 176
Reputation: 840
If you use the reshape2
package, the dcast
function can do the job.
First stack your vectors in a data.frame
:
df <- rbind(data.frame(IDs=names(v1), value=v1, vec=1),
data.frame(IDs=names(v2), value=v2, vec=2),
data.frame(IDs=names(v3), value=v3, vec=3),
data.frame(IDs=names(v4), value=v4, vec=4))
Then transform this in a wide format:
dcast(df, ids ~ vec, value.var="value")
This outputs a data.frame
but you can easily convert it back to a matrix
The speedup seems to increase as N
grows: 5x faster with N=5000
, 30x faster with N=10000
, 67x faster with N=50000
where N
is the length of v1
.
Upvotes: 2