queserasera
queserasera

Reputation: 71

Cosine similarity between rows of two dataframes in R

I have two dataframes DF1 and DF2, created like this:

A<-c("hello", "dave", "welcome", "to", "eden")
B<-1:5
C<-6:10
DF1<-data.frame(A,B,C)

D<-c("do", "you", "want", "this", "book")
E<-11:15
F<- 16:20
DF2<-data.frame(D,E,F)

Essentially, columns 2 and 3 are dimensions of the word in column 1. I want to compute cosine similarity of each word in DF1 to each word in DF2 and store it in a tabular form. Thanks for your help.

Upvotes: 0

Views: 759

Answers (2)

queserasera
queserasera

Reputation: 71

I found another way:

DF1 <- data.frame(DF1, row.names = 1)# convert the first column words as row names
DF2<-data.frame(DF2, row.names = 1)# # convert the first column words as row names

library(dplyr)
DF1 <-DF1 %>% 
    as.matrix() ## convert from dataframe to matrix
DF2 <-DF2 %>% 
    as.matrix() ## convert from dataframe to matrix

library(word2vec)
word2vec_similarity(DF1, DF2, type = "cosine")

Upvotes: 0

Guillaume
Guillaume

Reputation: 681

You can try with fuzzy join package.

library(dplyr)
library(fuzzyjoin)


A<-c("hello", "dave", "welcome", "to", "eden")
B<-1:5
C<-6:10
DF1<-data.frame(A,B,C)

D<-c("do", "you", "want", "this", "book")
E<-11:15
F<- 16:20
DF2<-data.frame(D,E,F)

DF1 %>% 
  stringdist_full_join(DF2, by = c('A' = 'D'), 
                       method = "cosine", 
                       distance_col = "distance")

It gives

         A B  C    D  E  F  distance
1    hello 1  6   do 11 16 0.7327388
2    hello 1  6  you 12 17 0.7817821
3    hello 1  6 want 13 18 1.0000000
4    hello 1  6 this 14 19 0.8110178
5    hello 1  6 book 15 20 0.6913933
6     dave 2  7   do 11 16 0.6464466
7     dave 2  7  you 12 17 1.0000000
8     dave 2  7 want 13 18 0.7500000
9     dave 2  7 this 14 19 1.0000000
10    dave 2  7 book 15 20 1.0000000
11 welcome 3  8   do 11 16 0.7642977
12 welcome 3  8  you 12 17 0.8075499
13 welcome 3  8 want 13 18 0.8333333
14 welcome 3  8 this 14 19 1.0000000
15 welcome 3  8 book 15 20 0.7278345
16      to 4  9   do 11 16 0.5000000
17      to 4  9  you 12 17 0.5917517
18      to 4  9 want 13 18 0.6464466
19      to 4  9 this 14 19 0.6464466
20      to 4  9 book 15 20 0.4226497
21    eden 5 10   do 11 16 0.7113249
22    eden 5 10  you 12 17 1.0000000
23    eden 5 10 want 13 18 0.7958759
24    eden 5 10 this 14 19 1.0000000
25    eden 5 10 book 15 20 1.0000000

But as you said below, you want to calculate vector cosine and not stringsim cosine so you can do something like:

DF_full <- DF2 %>% 
  mutate(id = 1) %>% 
  inner_join(DF1 %>% mutate(id = 1), by = 'id') %>% 
  mutate(vector_word_1 = purrr::map2(B,C, c),
         vector_word_2 = purrr::map2(E,F, c)) %>% 
  mutate(cosine_sim = purrr::map2(vector_word_1, vector_word_2, lsa::cosine))
 DF_full
      D  E  F id       A B  C vector_word_1 vector_word_2 cosine_sim
1    do 11 16  1   hello 1  6          1, 6        11, 16  0.9059667
2    do 11 16  1    dave 2  7          2, 7        11, 16  0.9479735
3    do 11 16  1 welcome 3  8          3, 8        11, 16   0.970496
4    do 11 16  1      to 4  9          4, 9        11, 16  0.9831082
5    do 11 16  1    eden 5 10         5, 10        11, 16  0.9904049
6   you 12 17  1   hello 1  6          1, 6        12, 17  0.9006583
7   you 12 17  1    dave 2  7          2, 7        12, 17  0.9439612
8   you 12 17  1 welcome 3  8          3, 8        12, 17  0.9674378
9   you 12 17  1      to 4  9          4, 9        12, 17  0.9807679

In your second comment, you note, what to do when vector length is 100 for instance, maybe:

df_1 <- tibble(A = rep(A, each = 100), 
               B = rnorm(500)) %>% 
  group_by(A) %>% 
  summarise(B = list(B))


df_2 <- tibble(D = rep(D, each = 100), 
               E = rnorm(500)) %>% 
  group_by(D) %>% 
  summarise(E = list(E))

DF_full <- df_2 %>% 
  mutate(id = 1) %>% 
  inner_join(df_1 %>% mutate(id = 1), by = 'id') %>% 
  mutate(cosine_sim = purrr::map2_dbl(B, E, lsa::cosine))

Upvotes: 1

Related Questions