Gerald T
Gerald T

Reputation: 734

Multiple matches in different row and columns for large data

I have some problems with multiple matches in two different tables (1 million rows * 15; 3000 * 20) which potentially may get much larger (10 million rows).

My solutions work but I would like to make the script as fast as possible considering that I may have to use it with larger data frames. I am using the r package data.table.

Consider two example tables where no row can be deleted:

table 1 - The ToMach column equal to FALSE means that the associated tag is not present in table2, this step reduced by two order of magnitude the matches to be performed:

set.seed(99)
table1 <- data.table(Tag = sample(paste0("tag_",1:3), 5, replace = T))
table1[ , ToMatch := ifelse(Tag == "tag_1", F, T)]

table1
     Tag ToMatch
1: tag_2    TRUE
2: tag_1   FALSE
3: tag_3    TRUE
4: tag_3    TRUE
5: tag_2    TRUE

table2:

set.seed(99)
table2 <- data.table(center = sample(paste0("tag_",2:8), 5, replace = T),
                 north  = sample(paste0("tag_",2:8), 5, replace = T),
                 south  = sample(paste0("tag_",2:8), 5, replace = T))

> table2
   center north south
1:  tag_6 tag_8 tag_5
2:  tag_2 tag_6 tag_5
3:  tag_6 tag_4 tag_3
4:  tag_8 tag_4 tag_6
5:  tag_5 tag_3 tag_6

My objective is to find the rows of table2 where the tags of table1 are found (can be in any column of the above columns). I was thinking of the output as a list:

Output:

     Tag ToMatch output
1: tag_2    TRUE      2
2: tag_1   FALSE     NA
3: tag_3    TRUE    3,5
4: tag_3    TRUE    3,5
5: tag_2    TRUE      2

My solution:

What rows of table 1 are to evaluate

match.index <- which(table1$ToMatch == T)
> match.index
[1] 1 3 4 5

Pool all tags from table 2. Maintain the row order using t (tag_6 tag_8 tag_5 tag_2 tag_6 tag_5 ...)

all.tags <- as.vector(t(table2))
> all.tags
 [1] "tag_6" "tag_8" "tag_5" "tag_2" "tag_6" "tag_5" "tag_6"
 [8] "tag_4" "tag_3" "tag_8" "tag_4" "tag_6" "tag_5" "tag_3"
[15] "tag_6"

Predefine an empty list

list.results <- as.list(rep(as.numeric(NA), dim(table1)[1]))

The loop:

for (i in 1:length(match.index)) {

    list.results[[ match.index[i] ]] <- ceiling(

        grep(table1[match.index[i], Tag], all.tags) 

        /3) 
}

# dividing the index of all.tags found with grep by 3 (the original
# number of columns in table2) and rounding up to the closest integer 
# (ceiling) return the index of the original table 2 where the tag 
# is located

Final output:

> table1[ , output := list.results]
> table1
     Tag ToMatch output
1: tag_2    TRUE      2
2: tag_1   FALSE     NA
3: tag_3    TRUE    3,5
4: tag_3    TRUE    3,5
5: tag_2    TRUE      2

Do you have any suggestion to speed up this code?

Thank you in advance

Upvotes: 0

Views: 151

Answers (2)

eddi
eddi

Reputation: 49448

The difficulty is mainly in the wide representation of table2. Once that's been melted down, the rest is easy:

melt(table2[, id := .I], id = 'id')[
  table1, on = c(value = 'Tag'), .(list(if(ToMatch) id)), by = .EACHI]
#   value   V1
#1: tag_2    2
#2: tag_1 NULL
#3: tag_3  5,3
#4: tag_3  5,3
#5: tag_2    2

And if you have a lot of duplicates - unique your data beforehand:

melt(table2[, id := .I], id = 'id')[
  unique(table1), on = c(value = 'Tag'), .(list(if(ToMatch) id)), by = .EACHI][
  table1, on = c(value = 'Tag')]

Upvotes: 1

lmo
lmo

Reputation: 38500

Here's a bit of base R code that will do the trick:

table1 <- within(table1, {
                 output <- NA
                 output[ToMatch] <- sapply(Tag[ToMatch], function(x) 
                                   paste(which(x == table2, arr.ind=TRUE)[,1], collapse=","))
})

which returns

table1

     Tag ToMatch output
1: tag_2    TRUE      2
2: tag_1   FALSE     NA
3: tag_3    TRUE    5,3
4: tag_3    TRUE    5,3
5: tag_2    TRUE      2

Here is a brief description. within allows reference within an object (often a data frame) and reduces the need for typing a bit. First, assign output NA. Then, for every element of output that you want to match (using ToMatch), use which to with the arr.ind=TRUE argument find the rows of the elements that match each. paste the results for each element together, collapsing on ",".


A data.table analog of the above code is

table1[, output := NA_character_][as.logical(ToMatch), 
       output := sapply(Tag, function(x) paste(which(x == table2, arr.ind=TRUE)[,1],
                                               collapse=","))][]
     Tag ToMatch output
1: tag_2    TRUE      2
2: tag_1   FALSE     NA
3: tag_3    TRUE    5,3
4: tag_3    TRUE    5,3
5: tag_2    TRUE      2

The first [] creates the NA vector and the second subsets to the elements of interest and fills in the NA vector with the desired values. This "filling in" section of the code is identical to the code above.

Upvotes: 1

Related Questions