Reputation: 323
I need to be able to compute pairwise intersection of lists, close to 40k. Specifically, I want to know if I can store vector id as column 1, and a list of its values in column 2. I should be able to process this column 2 , ie find overlap/intersections between two rows.
column 1 column 2
idA 1,2,5,9,10
idB 5,9,25
idC 2,25,67
I want to be able to get the pairwise intersection values and also, if the values in column 2 are not already sorted, that should also be possible.
What is the best datastructure that I can use if I am going ahead with R? My data originally looks like this:
column1 1 2 3 9 10 25 67 5
idA 1 1 0 1 1 0 0 1
idB 0 0 0 1 0 1 0 1
idC 0 1 0 0 0 1 1 0
edited to include more clarity as per the suggestions below.
Upvotes: 0
Views: 68
Reputation: 132576
I'd keep the data in a logical matrix:
DF <- read.table(text = "column1 1 2 3 9 10 25 67 5
idA 1 1 0 1 1 0 0 1
idB 0 0 0 1 0 1 0 1
idC 0 1 0 0 0 1 1 0", header = TRUE, check.names = FALSE)
#turn into logical matrix
m <- as.matrix(DF[-1])
rownames(m) <- DF[[1]]
mode(m) <- "logical"
#if you can, create your data as a sparse matrix to save memory
#if you already have a dense data matrix, keep it that way
library(Matrix)
M <- as(m, "lMatrix")
#calculate intersections
#does each comparison twice
intersections <- simplify2array(
lapply(seq_len(nrow(M)), function(x)
lapply(seq_len(nrow(M)), function(x, y) colnames(M)[M[x,] & (M[x,] == M[y,])], x = x)
)
)
This double loop could be optimized. I'd do it in Rcpp and create a long format data.frame instead of a list matrix. I'd also do each comparison only once (e.g., only the upper triangle).
colnames(intersections) <- rownames(intersections) <- rownames(M)
# idA idB idC
#idA Character,5 Character,2 "2"
#idB Character,2 Character,3 "25"
#idC "2" "25" Character,3
intersections["idA", "idB"]
#[[1]]
#[1] "9" "5"
Upvotes: 1