Daniel B.G
Daniel B.G

Reputation: 121

Text correlation with R

I'm working with a DF that contains several rows with Text ID, Text Corpus and count of words in said corpus. It looks something like this:

    ID                        Text     W_Count
Text_1         I love green apples           4
Text_2    I love yellow submarines           4
Text_3 Remember to buy some apples           5
Text_4               No match here           3

With that DF I want to calculate the number of words that all rows have in common with one another. For example Text_1 and Text_2 have two words in common while Text_1 and Text_3 have just one.

Once I have that, I need to display the data in a matrix similar to this one:

      ID Text_1 Text_2 Text_3 Text_4
Text_1      4      2      1      0
Text_2      2      4      0      0
Text_3      1      0      5      0
Text_4      0      0      0      3

I managed to do this with only two rows, for example Text_1 and Text_2:

Text_1 = df[1, 2]
Text_2 = df[2, 2]
Text_1_split <- unlist(strsplit(Text_1, split =" "))
Text_2_split <- unlist(strsplit(Text_2, split =" "))
count = length(intersect(Text_1_split, Text_2_split))
count
[1] 2

But I don't know how to apply this sistematically for all rows and then display the matrix I need.

Any help would be very much appreciated.

Upvotes: 3

Views: 202

Answers (1)

Daniel V
Daniel V

Reputation: 1384

You're probably looking for the vapply function. Consider the following:

vapply(df$ID, 
           function(x){
                sapply(df$ID, 
                       function(y){
                          x_split <- unlist(strsplit(df$Text[df$ID == x], split = " "))
                          y_split <- unlist(strsplit(df$Text[df$ID == y], split = " "))

                          return(length(intersect(x_split, y_split)))
                       })
            }, 
           integer(nrow(df)))

The vapply function ("vector-apply") applies a function across a series of inputs and returns a vector in the form of its third argument (in this case, an integer of length equal to the length of your data input.

Upvotes: 3

Related Questions