Reputation: 71
Let's say I have two datasets like this:
library(tidyverse)
df1 <- tibble(
abel = rnorm(10),
abby= rnorm(10),
bret= rnorm(10))
# A tibble: 10 × 3
abel abby bret
<dbl> <dbl> <dbl>
1 -1.24 -0.498 0.663
2 -0.179 -1.49 0.619
3 1.36 -2.19 1.43
4 0.145 0.520 1.01
5 -0.873 1.21 0.0441
6 -0.0434 0.103 0.201
7 -0.472 -0.389 -1.86
8 -0.486 -0.318 0.291
9 0.0884 -0.0458 -0.527
10 -0.0734 0.0780 0.709
df2<- tibble(
barista = rnorm(10),
beekeeper=rnorm(10),
economist = rnorm(10),
lawyer = rnorm(10),
ranger = rnorm(10),
trader = rnorm(10))
# A tibble: 10 × 6
barista beekeeper economist lawyer ranger trader
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 -0.577 -0.179 1.60 0.923 0.613 -0.446
2 -1.22 0.118 1.04 0.217 0.115 0.184
3 0.881 0.0564 0.222 -0.622 -0.543 -1.38
4 0.731 -1.60 0.701 1.15 -0.189 -0.0284
5 0.435 0.129 -1.76 -0.127 0.743 -0.932
6 1.57 -0.812 1.24 1.05 0.147 1.34
7 1.66 0.624 1.34 1.01 2.27 -0.223
8 0.116 0.334 0.553 0.794 -1.54 0.0656
9 0.672 1.11 0.752 0.187 0.594 0.468
10 -1.35 0.259 -1.63 -0.738 -1.96 -0.773
I want to compute cosine similarity between the two datasets in this manner: select the first word 'abel' in df1, ignore the first letter, take the second letter "b", look for words starting with "b" in df2 and compute cosine similarity. Repeat the process with the remaining letters in the first word.
Do the same with all the words of df1.
Has to account for repetitions in alphabets of df1. For instance, the word 'abby', has 'b" twice, it should be counted only once. Also, there is no word in df2 with 'y' so NA should be returned.
Thanks!
Upvotes: 1
Views: 261
Reputation: 2997
My solution gathers all combinations then calculates similarity. Try this:
library(tidyverse)
library(stringdist)
df1 <- tibble(
abel = rnorm(10),
abby= rnorm(10),
bret= rnorm(10))
df2<- tibble(
barista = rnorm(10),
beekeeper=rnorm(10),
economist = rnorm(10),
lawyer = rnorm(10),
ranger = rnorm(10),
trader = rnorm(10))
df_2 <- colnames(df2)
df_1 <- as_tibble_col(colnames(df1), "col1") %>%
mutate(col2 = map(col1, function(x) mutate(as_tibble_col(unique(unlist(str_split(x, ""))[-1]), "first_letters"),
match = map(first_letters, ~df_2[which(str_detect(df_2, paste0("^", .)))])))) %>%
unnest(everything()) %>%
unnest(everything()) %>%
mutate(cos_sim = stringsim(col1, match, "cosine"))
# # A tibble: 9 × 4
# col1 first_letters match cos_sim
# <chr> <chr> <chr> <dbl>
# 1 abel b barista 0.5
# 2 abel b beekeeper 0.557
# 3 abel e economist 0.151
# 4 abel l lawyer 0.612
# 5 abby b barista 0.544
# 6 abby b beekeeper 0.152
# 7 bret r ranger 0.530
# 8 bret e economist 0.302
# 9 bret t trader 0.707
Upvotes: 1