Reputation: 6222
My aim is to find row indices of a matrix (dat
) that contain matching rows of another matrix (xy
).
I find it easy to do this with smaller matrices, as shown in the examples. But the matrices I have a very large number of rows.
For toy example the matrices dat
and xy
are given below. The aim is to recover the indices 14, 58, 99. In my case, both these matrices have a very larger number of rows.
# toy data
dat <- iris
dat$Sepal.Length <- dat$Sepal.Length * (1 + runif(150))
xy <- dat[c(14, 58, 99), c(1, 5)]
For small matrices, the solutions would be
# solution 1
ind <- NULL
for(j in 1 : length(x)) {
ind[j] <- which((dat$Sepal.Length ==xy[j, 1]) & (dat$Species == xy[j, 2]))
}
Or
# solution 2
which(outer(dat$Sepal.Length, xy[, 1], "==") &
outer(dat$Species, xy[, 2], "=="), arr.ind=TRUE)
But given the size of my data, these methods are not feasible. The first method takes a lot of time and the other fails due to lack of memory.
I wish I know more data.table
and dplyr
.
Upvotes: 2
Views: 564
Reputation: 66819
With data.table, it's a join:
library(data.table)
setDT(dat); setDT(xy)
dat[xy, on=names(xy), which=TRUE]
# [1] 14 58 99
Upvotes: 4
Reputation: 66
if your data is huge, You can hash your rows firstly (for both matrix),then match the row hash values,using digest package.
target_matrix<-iris
query_matrix<-iris[c(14, 58, 99),]
target_row_hash<-apply(target_matrix,1,digest)
query_row_hash<-apply(query_matrix,1,digest)
row_nums<-match(query_row_hash,target_row_hash)
row_nums
14 58 99
Upvotes: 1
Reputation: 1263
For the setup you provided, you could use:
library(tidyverse)
dat %>%
mutate(row_num = row_number()) %>%
inner_join(xy, by = c("Sepal.Length", "Species")) %>%
pull(row_num)
This adds a column for the initial row number, does an inner join to produce a data frame with rows in dat that match rows from xy, and then pulls the indices. (An inner join will return all rows from dat that match rows from xy, while a semi-join will return only one row from dat for each row in xy.)
It's worth noting that in this example we are dealing with data frames, not matrices:
> class(xy)
[1] "data.frame"
> class(dat)
[1] "data.frame"
The above code won't work if the data is in matrix form - can you convert your matrices to data frames or tibbles?
Upvotes: 1
Reputation: 120
Following chinsoon12's suggestion, try this:
library(dplyr)
dat$rowind <- 1:nrow(dat) # adds row index if wanted (not necessary though)
newDf <- semi_join(dat, xy, by = c("Species", "Sepal.Length"))
Upvotes: 1
Reputation: 5398
I used paste0() to concatenate Sepal.Length and Species into a temporary variable.
Then match() to return the index of the matches between the two temporary variables.
Then not, '!', is.na() to remove the non-matches and convert to a logical vector.
Then return which() indices are true.
which(!is.na(match(paste0(dat$Sepal.Length, dat$Species), paste0(xy$Sepal.Length, xy$Species))))
[1] 14 58 99
PS: merge() accepts combined variables in by.x and by.y:
merge(dat, xy, by.x=c("Sepal.Length", "Species"), by.y=c("Sepal.Length", "Species"), all.x=FALSE, all.y=TRUE)
Upvotes: 1
Reputation: 416
You could try this dplyr
solution. Depends on how big your data frames are.
#use dplyr filter
library(dplyr)
dat %>%
mutate(row_no = row_number()) %>%
filter(Sepal.Length %in% xy$Sepal.Length & Species %in% xy$Species) %>%
select(row_no)
#> row_no
#> 1 14
#> 2 58
#> 3 99
Upvotes: 1