kangaroo_cliff
kangaroo_cliff

Reputation: 6222

Finding rows of a large matrix that match specific values

My aim is to find row indices of a matrix (dat) that contain matching rows of another matrix (xy).

I find it easy to do this with smaller matrices, as shown in the examples. But the matrices I have a very large number of rows.

For toy example the matrices dat and xy are given below. The aim is to recover the indices 14, 58, 99. In my case, both these matrices have a very larger number of rows.

# toy data
dat <- iris
dat$Sepal.Length <- dat$Sepal.Length * (1 + runif(150))

xy <- dat[c(14, 58, 99), c(1, 5)]

For small matrices, the solutions would be

# solution 1
ind <- NULL
for(j in 1 : length(x)) {

  ind[j] <- which((dat$Sepal.Length ==xy[j, 1]) & (dat$Species == xy[j, 2]))
}

Or

# solution 2
which(outer(dat$Sepal.Length, xy[, 1], "==") & 
        outer(dat$Species, xy[, 2], "=="), arr.ind=TRUE)

But given the size of my data, these methods are not feasible. The first method takes a lot of time and the other fails due to lack of memory.

I wish I know more data.table and dplyr.

Upvotes: 2

Views: 564

Answers (6)

Frank
Frank

Reputation: 66819

With data.table, it's a join:

library(data.table)
setDT(dat); setDT(xy)

dat[xy, on=names(xy), which=TRUE]
# [1] 14 58 99

Upvotes: 4

harryjerry
harryjerry

Reputation: 66

if your data is huge, You can hash your rows firstly (for both matrix),then match the row hash values,using digest package.

target_matrix<-iris
query_matrix<-iris[c(14, 58, 99),]
target_row_hash<-apply(target_matrix,1,digest)
query_row_hash<-apply(query_matrix,1,digest)
row_nums<-match(query_row_hash,target_row_hash)
row_nums

output:

14 58 99

Upvotes: 1

cardinal40
cardinal40

Reputation: 1263

For the setup you provided, you could use:

library(tidyverse)
dat %>% 
  mutate(row_num = row_number()) %>% 
  inner_join(xy, by = c("Sepal.Length", "Species")) %>%
  pull(row_num)

This adds a column for the initial row number, does an inner join to produce a data frame with rows in dat that match rows from xy, and then pulls the indices. (An inner join will return all rows from dat that match rows from xy, while a semi-join will return only one row from dat for each row in xy.)

It's worth noting that in this example we are dealing with data frames, not matrices:

> class(xy)
[1] "data.frame"
> class(dat)
[1] "data.frame"

The above code won't work if the data is in matrix form - can you convert your matrices to data frames or tibbles?

Upvotes: 1

Brindle Cruncher
Brindle Cruncher

Reputation: 120

Following chinsoon12's suggestion, try this:

library(dplyr)
dat$rowind <- 1:nrow(dat) # adds row index if wanted (not necessary though)
newDf <- semi_join(dat, xy, by = c("Species", "Sepal.Length"))

Upvotes: 1

M.Viking
M.Viking

Reputation: 5398

I used paste0() to concatenate Sepal.Length and Species into a temporary variable.

Then match() to return the index of the matches between the two temporary variables.

Then not, '!', is.na() to remove the non-matches and convert to a logical vector.

Then return which() indices are true.

which(!is.na(match(paste0(dat$Sepal.Length, dat$Species), paste0(xy$Sepal.Length, xy$Species))))

[1] 14 58 99

PS: merge() accepts combined variables in by.x and by.y:

merge(dat, xy, by.x=c("Sepal.Length", "Species"), by.y=c("Sepal.Length", "Species"), all.x=FALSE, all.y=TRUE)

Upvotes: 1

Suhas Hegde
Suhas Hegde

Reputation: 416

You could try this dplyr solution. Depends on how big your data frames are.

#use dplyr filter
library(dplyr)

dat %>%
  mutate(row_no = row_number()) %>%
  filter(Sepal.Length %in% xy$Sepal.Length & Species %in% xy$Species) %>%
  select(row_no)
#>   row_no
#> 1     14
#> 2     58
#> 3     99

Upvotes: 1

Related Questions