Reputation: 6409
I have a huge data frame:
library(gtools)
a<-permutations(2,20,v=c(0,1),repeats.allowed=TRUE)
a<-as.data.frame(a)
And I have 100 random strings:
set.seed(123)
b<-replicate(100,sample(c(0,1),20, replace=T))
I would like to identify the row numbers in in 'a'
that corresponds to each column in 'b'
.
Since 'a'
is huge this process takes quite some time.
right now I am using the following method:
sapply(1:100, function(x) which(colSums(t(a)==as.numeric(b[,x]))==20L))
This process takes a lot of time. I was wondering if there is a more efficient way to do this?
Upvotes: 0
Views: 80
Reputation: 46866
Represent the columns as digits by thinking of them as bit strings, then use %in% for fast look-up
library(gtools)
a <- permutations(2,20,v=c(0,1),repeats.allowed=TRUE)
a <- as.data.frame(a)
set.seed(123)
b <- replicate(100, sample(c(0, 1), 20, replace=TRUE))
a1 <- colSums(t(a) * 2^(0:19))
b1 <- colSums(b * 2^(0:19))
which produces
> head(which(a1 %in% b1))
[1] 1191 9434 14502 19812 30619 34313
To deal with duplicates, consider this smaller example
b1 <- c(1, 3, 3, 5, 4)
a1 <- c(3, 4, 8)
Discover the unique b1 values, and create a list that maps from the unique values to the index in the original values
ub1 <- unique(b1)
umap <- unname(split(seq_along(b1), match(b1, ub1)))
Now match the a1 to the unique b1, decide which to keep (are not NA), and look up the matches in the unique map
m <- match(a1, ub1)
keep <- which(!is.na(m))
keepmap <- umap[m[keep]]
Finally, use keepmap to figure out how many times each kept value needs to be replicated (because it maps to multiple original values) and create a data.frame of the results
len <- sapply(keepmap, length)
data.frame(ai=rep(keep, len),
a1=rep(a1[keep], len),
b1=unlist(unname(keepmap)))
So a complete function is
matchrows <-
function(a, b)
{
## encode
a1 <- colSums(t(a) * 2^(0:19))
b1 <- colSums(b * 2^(0:19))
## match to unique values
ub1 <- unique(b1)
m <- match(a1, ub1)
keep <- which(!is.na(m))
## expand unique matches to original coordinates
umap <- unname(split(seq_along(b1), match(b1, ub1)))
keepmap <- umap[m[keep]]
len <- sapply(keepmap, length)
data.frame(ai=rep(keep, len),
bi=unlist(unname(keepmap)),
value=rep(a1[keep], len))
}
Upvotes: 2