Reputation: 11657
Let's say I have the following data.frame
test_df = read.table(text = "id, chat
11, hey how are you im doing well
12, hey how are you im
13, my name is bob
14, my name is bob. how are you?
15, test test
16, test testing
17, who is this
18, who is this. who are you?", header = T, sep = ",")
I'm trying to identify which pairs of rows, for the column chat
have the most similar values via string distance. I then want to create a column partner_id
which identifies the id of the person with the most similar value of chat
.
Here's the desired output:
final_df = read.table(text = "id, chat, partner_id
11, hey how are you im doing well, 12
12, hey how are you im, 11
13, my name is bob, 14
14, my name is bob. how are you?, 13
15, test test, 16
16, test testing, 15
17, who is this, 18
18, who is this. who are you?, 17", header = T, sep = ",")
I know that the stringdist
library has various fuzzy matching calculations, but I can't think of a better solution than just iterating over all the rows, computing every pairwise string distance, and then storing the best solution (which becomes computationally challenging with big data.frames). It's particularly challenging since every id can only have exactly one partner_id, so once a "best" match has been selected, that id
should no longer be available for matching.
Upvotes: 2
Views: 329
Reputation: 8484
Since you want to match each chat with all the others, the complexity of the algorithm will obviously be high.
However, you can remove the id of the chats that already have a match from the competitors, so that each step take a little shorter than the previous one.
As much as I hate for
loops in R, I couldn't find a purrr
solution so here we go:
library(tidyverse)
test_df = read.table(text = "id, chat, partner_id_orig
11, hey how are you im doing well, 12
12, hey how are you im, 11
13, my name is bob, 14
14, my name is bob. how are you?, 13
15, test test, 16
16, test testing, 15
17, who is this, 18
18, who is this. who are you?, 17", header = T, sep = ",")
method="qgram" #qgram, cosine, jw, and sounded matches your expected output
test_df$partner_id=NA
for(i in seq_len(nrow(test_df))){
c_chat = test_df$chat[i]
c_id = test_df$id[i]
competitors_id = test_df$id[!test_df$id %in% c(test_df$partner_id, c_id)]
competitors = test_df$chat[test_df$id %in% competitors_id]
print(length(competitors))
dist = stringdist::stringdist(competitors, c_chat, method=method)
best_id = competitors_id[which.min(dist)]
test_df$partner_id[i] = best_id
}
#> [1] 7
#> [1] 7
#> [1] 5
#> [1] 5
#> [1] 3
#> [1] 3
#> [1] 1
#> [1] 1
test_df
#> id chat partner_id_orig partner_id
#> 1 11 hey how are you im doing well 12 12
#> 2 12 hey how are you im 11 11
#> 3 13 my name is bob 14 14
#> 4 14 my name is bob. how are you? 13 13
#> 5 15 test test 16 16
#> 6 16 test testing 15 15
#> 7 17 who is this 18 18
#> 8 18 who is this. who are you? 17 17
Created on 2021-10-01 by the reprex package (v2.0.0)
As you can see, length(competitors)
decreases with each step so it should not take as long as computing every pairwise string distance. However, it might still be computationally challenging with big data.frames
.
Upvotes: 1