Reputation: 71
I have two dataframes DF1 and DF2, created like this:
A<-c("hello", "dave", "welcome", "to", "eden")
B<-1:5
C<-6:10
DF1<-data.frame(A,B,C)
D<-c("do", "you", "want", "this", "book")
E<-11:15
F<- 16:20
DF2<-data.frame(D,E,F)
Essentially, columns 2 and 3 are dimensions of the word in column 1. I want to compute cosine similarity of each word in DF1 to each word in DF2 and store it in a tabular form. Thanks for your help.
Upvotes: 0
Views: 759
Reputation: 71
I found another way:
DF1 <- data.frame(DF1, row.names = 1)# convert the first column words as row names
DF2<-data.frame(DF2, row.names = 1)# # convert the first column words as row names
library(dplyr)
DF1 <-DF1 %>%
as.matrix() ## convert from dataframe to matrix
DF2 <-DF2 %>%
as.matrix() ## convert from dataframe to matrix
library(word2vec)
word2vec_similarity(DF1, DF2, type = "cosine")
Upvotes: 0
Reputation: 681
You can try with fuzzy join package.
library(dplyr)
library(fuzzyjoin)
A<-c("hello", "dave", "welcome", "to", "eden")
B<-1:5
C<-6:10
DF1<-data.frame(A,B,C)
D<-c("do", "you", "want", "this", "book")
E<-11:15
F<- 16:20
DF2<-data.frame(D,E,F)
DF1 %>%
stringdist_full_join(DF2, by = c('A' = 'D'),
method = "cosine",
distance_col = "distance")
It gives
A B C D E F distance
1 hello 1 6 do 11 16 0.7327388
2 hello 1 6 you 12 17 0.7817821
3 hello 1 6 want 13 18 1.0000000
4 hello 1 6 this 14 19 0.8110178
5 hello 1 6 book 15 20 0.6913933
6 dave 2 7 do 11 16 0.6464466
7 dave 2 7 you 12 17 1.0000000
8 dave 2 7 want 13 18 0.7500000
9 dave 2 7 this 14 19 1.0000000
10 dave 2 7 book 15 20 1.0000000
11 welcome 3 8 do 11 16 0.7642977
12 welcome 3 8 you 12 17 0.8075499
13 welcome 3 8 want 13 18 0.8333333
14 welcome 3 8 this 14 19 1.0000000
15 welcome 3 8 book 15 20 0.7278345
16 to 4 9 do 11 16 0.5000000
17 to 4 9 you 12 17 0.5917517
18 to 4 9 want 13 18 0.6464466
19 to 4 9 this 14 19 0.6464466
20 to 4 9 book 15 20 0.4226497
21 eden 5 10 do 11 16 0.7113249
22 eden 5 10 you 12 17 1.0000000
23 eden 5 10 want 13 18 0.7958759
24 eden 5 10 this 14 19 1.0000000
25 eden 5 10 book 15 20 1.0000000
But as you said below, you want to calculate vector cosine and not stringsim cosine so you can do something like:
DF_full <- DF2 %>%
mutate(id = 1) %>%
inner_join(DF1 %>% mutate(id = 1), by = 'id') %>%
mutate(vector_word_1 = purrr::map2(B,C, c),
vector_word_2 = purrr::map2(E,F, c)) %>%
mutate(cosine_sim = purrr::map2(vector_word_1, vector_word_2, lsa::cosine))
DF_full
D E F id A B C vector_word_1 vector_word_2 cosine_sim
1 do 11 16 1 hello 1 6 1, 6 11, 16 0.9059667
2 do 11 16 1 dave 2 7 2, 7 11, 16 0.9479735
3 do 11 16 1 welcome 3 8 3, 8 11, 16 0.970496
4 do 11 16 1 to 4 9 4, 9 11, 16 0.9831082
5 do 11 16 1 eden 5 10 5, 10 11, 16 0.9904049
6 you 12 17 1 hello 1 6 1, 6 12, 17 0.9006583
7 you 12 17 1 dave 2 7 2, 7 12, 17 0.9439612
8 you 12 17 1 welcome 3 8 3, 8 12, 17 0.9674378
9 you 12 17 1 to 4 9 4, 9 12, 17 0.9807679
In your second comment, you note, what to do when vector length is 100 for instance, maybe:
df_1 <- tibble(A = rep(A, each = 100),
B = rnorm(500)) %>%
group_by(A) %>%
summarise(B = list(B))
df_2 <- tibble(D = rep(D, each = 100),
E = rnorm(500)) %>%
group_by(D) %>%
summarise(E = list(E))
DF_full <- df_2 %>%
mutate(id = 1) %>%
inner_join(df_1 %>% mutate(id = 1), by = 'id') %>%
mutate(cosine_sim = purrr::map2_dbl(B, E, lsa::cosine))
Upvotes: 1