Computing cosine similarity between specific strings using dplyr

Question

Let's say I have two datasets like this:

library(tidyverse)
df1 <- tibble(
abel = rnorm(10),
abby= rnorm(10),
bret= rnorm(10))

# A tibble: 10 × 3
      abel    abby    bret
           
 1 -1.24   -0.498   0.663 
 2 -0.179  -1.49    0.619 
 3  1.36   -2.19    1.43  
 4  0.145   0.520   1.01  
 5 -0.873   1.21    0.0441
 6 -0.0434  0.103   0.201 
 7 -0.472  -0.389  -1.86  
 8 -0.486  -0.318   0.291 
 9  0.0884 -0.0458 -0.527 
10 -0.0734  0.0780  0.709 

df2<- tibble(
barista = rnorm(10),
beekeeper=rnorm(10),
economist = rnorm(10),
lawyer = rnorm(10),
ranger = rnorm(10),
trader = rnorm(10))

# A tibble: 10 × 6
   barista beekeeper economist lawyer ranger  trader
                      
 1  -0.577   -0.179      1.60   0.923  0.613 -0.446 
 2  -1.22     0.118      1.04   0.217  0.115  0.184 
 3   0.881    0.0564     0.222 -0.622 -0.543 -1.38  
 4   0.731   -1.60       0.701  1.15  -0.189 -0.0284
 5   0.435    0.129     -1.76  -0.127  0.743 -0.932 
 6   1.57    -0.812      1.24   1.05   0.147  1.34  
 7   1.66     0.624      1.34   1.01   2.27  -0.223 
 8   0.116    0.334      0.553  0.794 -1.54   0.0656
 9   0.672    1.11       0.752  0.187  0.594  0.468 
10  -1.35     0.259     -1.63  -0.738 -1.96  -0.773

I want to compute cosine similarity between the two datasets in this manner: select the first word 'abel' in df1, ignore the first letter, take the second letter "b", look for words starting with "b" in df2 and compute cosine similarity. Repeat the process with the remaining letters in the first word.

Do the same with all the words of df1.

Has to account for repetitions in alphabets of df1. For instance, the word 'abby', has 'b" twice, it should be counted only once. Also, there is no word in df2 with 'y' so NA should be returned.

Thanks!

dcsuka · Accepted Answer

My solution gathers all combinations then calculates similarity. Try this:

library(tidyverse)
library(stringdist)

df1 <- tibble(
abel = rnorm(10),
abby= rnorm(10),
bret= rnorm(10))

df2<- tibble(
barista = rnorm(10),
beekeeper=rnorm(10),
economist = rnorm(10),
lawyer = rnorm(10),
ranger = rnorm(10),
trader = rnorm(10))


df_2 <- colnames(df2)
df_1 <- as_tibble_col(colnames(df1), "col1") %>%
  mutate(col2 = map(col1, function(x) mutate(as_tibble_col(unique(unlist(str_split(x, ""))[-1]), "first_letters"),
                                             match = map(first_letters, ~df_2[which(str_detect(df_2, paste0("^", .)))])))) %>%
  unnest(everything()) %>%
  unnest(everything()) %>%
  mutate(cos_sim = stringsim(col1, match, "cosine"))

# # A tibble: 9 × 4
#   col1  first_letters match     cos_sim
#                    
# 1 abel  b             barista     0.5  
# 2 abel  b             beekeeper   0.557
# 3 abel  e             economist   0.151
# 4 abel  l             lawyer      0.612
# 5 abby  b             barista     0.544
# 6 abby  b             beekeeper   0.152
# 7 bret  r             ranger      0.530
# 8 bret  e             economist   0.302
# 9 bret  t             trader      0.707

Computing cosine similarity between specific strings using dplyr

Answers (1)

Related Questions