Parseltongue
Parseltongue

Reputation: 11657

Identify which pairs of rows are most similar (string distance) in a data.frame

Let's say I have the following data.frame

test_df = read.table(text = "id, chat
11, hey how are you im doing well
12, hey how are you im
13, my name is bob
14, my name is bob. how are you?
15, test test
16, test testing
17, who is this
18, who is this. who are you?", header = T, sep = ",")

I'm trying to identify which pairs of rows, for the column chat have the most similar values via string distance. I then want to create a column partner_id which identifies the id of the person with the most similar value of chat.

Here's the desired output:

final_df = read.table(text = "id, chat, partner_id
11, hey how are you im doing well, 12
12, hey how are you im, 11
13, my name is bob, 14
14, my name is bob. how are you?, 13
15, test test, 16
16, test testing, 15
17, who is this, 18
18, who is this. who are you?, 17", header = T, sep = ",")

I know that the stringdist library has various fuzzy matching calculations, but I can't think of a better solution than just iterating over all the rows, computing every pairwise string distance, and then storing the best solution (which becomes computationally challenging with big data.frames). It's particularly challenging since every id can only have exactly one partner_id, so once a "best" match has been selected, that id should no longer be available for matching.

Upvotes: 2

Views: 329

Answers (1)

Dan Chaltiel
Dan Chaltiel

Reputation: 8484

Since you want to match each chat with all the others, the complexity of the algorithm will obviously be high.

However, you can remove the id of the chats that already have a match from the competitors, so that each step take a little shorter than the previous one.

As much as I hate for loops in R, I couldn't find a purrr solution so here we go:

library(tidyverse)
test_df = read.table(text = "id, chat, partner_id_orig
11, hey how are you im doing well, 12
12, hey how are you im, 11
13, my name is bob, 14
14, my name is bob. how are you?, 13
15, test test, 16
16, test testing, 15
17, who is this, 18
18, who is this. who are you?, 17", header = T, sep = ",")

method="qgram" #qgram, cosine, jw, and sounded matches your expected output
test_df$partner_id=NA
for(i in seq_len(nrow(test_df))){
    c_chat = test_df$chat[i]
    c_id = test_df$id[i]
    competitors_id = test_df$id[!test_df$id %in% c(test_df$partner_id, c_id)]
    competitors = test_df$chat[test_df$id %in% competitors_id]
    print(length(competitors))
    dist = stringdist::stringdist(competitors, c_chat, method=method)
    best_id = competitors_id[which.min(dist)]
    test_df$partner_id[i] = best_id
}
#> [1] 7
#> [1] 7
#> [1] 5
#> [1] 5
#> [1] 3
#> [1] 3
#> [1] 1
#> [1] 1
test_df
#>   id                           chat partner_id_orig partner_id
#> 1 11  hey how are you im doing well              12         12
#> 2 12             hey how are you im              11         11
#> 3 13                 my name is bob              14         14
#> 4 14   my name is bob. how are you?              13         13
#> 5 15                      test test              16         16
#> 6 16                   test testing              15         15
#> 7 17                    who is this              18         18
#> 8 18      who is this. who are you?              17         17

Created on 2021-10-01 by the reprex package (v2.0.0)

As you can see, length(competitors) decreases with each step so it should not take as long as computing every pairwise string distance. However, it might still be computationally challenging with big data.frames.

Upvotes: 1

Related Questions