Tom
Tom

Reputation: 2351

For each word in a string, check if it is part of another string of words

I have a DT as follows:

DT <- fread("
ID  Sentence_1                  Sentence_2                        iso3c   year
1   This_is_an_example_sentence This_is_another_example_sentence  ARG     1983
2   The_dog_walks_in_the_park   This_is_another_example_sentence  ARG     1983
5   The_dog_walks_in_the_park   A_frisby_is_thrown_in_the_park    NLD     1984
6   I_like_soup                 A_frisby_is_thrown_in_the_park    NLD     1984",
header=TRUE)
DT$Sentence_1 <- gsub("_", " ", DT$Sentence_1)
DT$Sentence_2 <- gsub("_", " ", DT$Sentence_2)

I would like for each word in Sentence_1 to check if that word also exists in Sentence_2. I would like to have the result of that query stored in a separate column.

DESIRED OUTPUT:

DT <- fread("
ID  Sentence_1                  Sentence_2                        iso3c   year  matching_score
1   This_is_an_example_sentence This_is_another_example_sentence  ARG     1983  4
2   The_dog_walks_in_the_park   This_is_another_example_sentence  ARG     1983  0
5   The_dog_walks_in_the_park   A_frisby_is_thrown_in_the_park    NLD     1984  3
6   I_like_soup                 A_frisby_is_thrown_in_the_park    NLD     1984  0",
header=TRUE)

What would be the most efficient way of doing this?

Upvotes: 0

Views: 45

Answers (1)

s_baldur
s_baldur

Reputation: 33613

DT[, `:=`(s1l = strsplit(Sentence_1, "_"), s2l = strsplit(Sentence_2, "_"))]
DT[, matching_score := sum(s1l[[1]] %in% s2l[[1]]), by = ID][, !c("s1l", "s2l")]
DT



   ID                  Sentence_1                       Sentence_2 iso3c year matching_score
1:  1 This_is_an_example_sentence This_is_another_example_sentence   ARG 1983              4
2:  2   The_dog_walks_in_the_park This_is_another_example_sentence   ARG 1983              0
3:  5   The_dog_walks_in_the_park   A_frisby_is_thrown_in_the_park   NLD 1984              3
4:  6                 I_like_soup   A_frisby_is_thrown_in_the_park   NLD 1984              0

Upvotes: 3

Related Questions