Reputation: 2351
I have a DT as follows:
DT <- fread("
ID Sentence_1 Sentence_2 iso3c year
1 This_is_an_example_sentence This_is_another_example_sentence ARG 1983
2 The_dog_walks_in_the_park This_is_another_example_sentence ARG 1983
5 The_dog_walks_in_the_park A_frisby_is_thrown_in_the_park NLD 1984
6 I_like_soup A_frisby_is_thrown_in_the_park NLD 1984",
header=TRUE)
DT$Sentence_1 <- gsub("_", " ", DT$Sentence_1)
DT$Sentence_2 <- gsub("_", " ", DT$Sentence_2)
I would like for each word in Sentence_1
to check if that word also exists in Sentence_2
. I would like to have the result of that query stored in a separate column.
DESIRED OUTPUT:
DT <- fread("
ID Sentence_1 Sentence_2 iso3c year matching_score
1 This_is_an_example_sentence This_is_another_example_sentence ARG 1983 4
2 The_dog_walks_in_the_park This_is_another_example_sentence ARG 1983 0
5 The_dog_walks_in_the_park A_frisby_is_thrown_in_the_park NLD 1984 3
6 I_like_soup A_frisby_is_thrown_in_the_park NLD 1984 0",
header=TRUE)
What would be the most efficient way of doing this?
Upvotes: 0
Views: 45
Reputation: 33613
DT[, `:=`(s1l = strsplit(Sentence_1, "_"), s2l = strsplit(Sentence_2, "_"))]
DT[, matching_score := sum(s1l[[1]] %in% s2l[[1]]), by = ID][, !c("s1l", "s2l")]
DT
ID Sentence_1 Sentence_2 iso3c year matching_score
1: 1 This_is_an_example_sentence This_is_another_example_sentence ARG 1983 4
2: 2 The_dog_walks_in_the_park This_is_another_example_sentence ARG 1983 0
3: 5 The_dog_walks_in_the_park A_frisby_is_thrown_in_the_park NLD 1984 3
4: 6 I_like_soup A_frisby_is_thrown_in_the_park NLD 1984 0
Upvotes: 3