Reputation: 734
I have some problems with multiple matches in two different tables (1 million rows * 15; 3000 * 20) which potentially may get much larger (10 million rows).
My solutions work but I would like to make the script as fast as possible considering that I may have to use it with larger data frames. I am using the r package data.table.
Consider two example tables where no row can be deleted:
table 1 - The ToMach column equal to FALSE means that the associated tag is not present in table2, this step reduced by two order of magnitude the matches to be performed:
set.seed(99)
table1 <- data.table(Tag = sample(paste0("tag_",1:3), 5, replace = T))
table1[ , ToMatch := ifelse(Tag == "tag_1", F, T)]
table1
Tag ToMatch
1: tag_2 TRUE
2: tag_1 FALSE
3: tag_3 TRUE
4: tag_3 TRUE
5: tag_2 TRUE
table2:
set.seed(99)
table2 <- data.table(center = sample(paste0("tag_",2:8), 5, replace = T),
north = sample(paste0("tag_",2:8), 5, replace = T),
south = sample(paste0("tag_",2:8), 5, replace = T))
> table2
center north south
1: tag_6 tag_8 tag_5
2: tag_2 tag_6 tag_5
3: tag_6 tag_4 tag_3
4: tag_8 tag_4 tag_6
5: tag_5 tag_3 tag_6
My objective is to find the rows of table2 where the tags of table1 are found (can be in any column of the above columns). I was thinking of the output as a list:
Output:
Tag ToMatch output
1: tag_2 TRUE 2
2: tag_1 FALSE NA
3: tag_3 TRUE 3,5
4: tag_3 TRUE 3,5
5: tag_2 TRUE 2
My solution:
What rows of table 1 are to evaluate
match.index <- which(table1$ToMatch == T)
> match.index
[1] 1 3 4 5
Pool all tags from table 2. Maintain the row order using t
(tag_6 tag_8 tag_5 tag_2 tag_6 tag_5 ...
)
all.tags <- as.vector(t(table2))
> all.tags
[1] "tag_6" "tag_8" "tag_5" "tag_2" "tag_6" "tag_5" "tag_6"
[8] "tag_4" "tag_3" "tag_8" "tag_4" "tag_6" "tag_5" "tag_3"
[15] "tag_6"
Predefine an empty list
list.results <- as.list(rep(as.numeric(NA), dim(table1)[1]))
The loop:
for (i in 1:length(match.index)) {
list.results[[ match.index[i] ]] <- ceiling(
grep(table1[match.index[i], Tag], all.tags)
/3)
}
# dividing the index of all.tags found with grep by 3 (the original
# number of columns in table2) and rounding up to the closest integer
# (ceiling) return the index of the original table 2 where the tag
# is located
Final output:
> table1[ , output := list.results]
> table1
Tag ToMatch output
1: tag_2 TRUE 2
2: tag_1 FALSE NA
3: tag_3 TRUE 3,5
4: tag_3 TRUE 3,5
5: tag_2 TRUE 2
Do you have any suggestion to speed up this code?
Thank you in advance
Upvotes: 0
Views: 151
Reputation: 49448
The difficulty is mainly in the wide representation of table2
. Once that's been melted down, the rest is easy:
melt(table2[, id := .I], id = 'id')[
table1, on = c(value = 'Tag'), .(list(if(ToMatch) id)), by = .EACHI]
# value V1
#1: tag_2 2
#2: tag_1 NULL
#3: tag_3 5,3
#4: tag_3 5,3
#5: tag_2 2
And if you have a lot of duplicates - unique your data beforehand:
melt(table2[, id := .I], id = 'id')[
unique(table1), on = c(value = 'Tag'), .(list(if(ToMatch) id)), by = .EACHI][
table1, on = c(value = 'Tag')]
Upvotes: 1
Reputation: 38500
Here's a bit of base R code that will do the trick:
table1 <- within(table1, {
output <- NA
output[ToMatch] <- sapply(Tag[ToMatch], function(x)
paste(which(x == table2, arr.ind=TRUE)[,1], collapse=","))
})
which returns
table1
Tag ToMatch output
1: tag_2 TRUE 2
2: tag_1 FALSE NA
3: tag_3 TRUE 5,3
4: tag_3 TRUE 5,3
5: tag_2 TRUE 2
Here is a brief description. within
allows reference within an object (often a data frame) and reduces the need for typing a bit. First, assign output NA. Then, for every element of output that you want to match (using ToMatch), use which
to with the arr.ind=TRUE argument find the rows of the elements that match each. paste
the results for each element together, collapsing on ",".
A data.table
analog of the above code is
table1[, output := NA_character_][as.logical(ToMatch),
output := sapply(Tag, function(x) paste(which(x == table2, arr.ind=TRUE)[,1],
collapse=","))][]
Tag ToMatch output
1: tag_2 TRUE 2
2: tag_1 FALSE NA
3: tag_3 TRUE 5,3
4: tag_3 TRUE 5,3
5: tag_2 TRUE 2
The first []
creates the NA vector and the second subsets to the elements of interest and fills in the NA vector with the desired values. This "filling in" section of the code is identical to the code above.
Upvotes: 1