Reputation: 121
I'm working with a DF that contains several rows with Text ID, Text Corpus and count of words in said corpus. It looks something like this:
ID Text W_Count
Text_1 I love green apples 4
Text_2 I love yellow submarines 4
Text_3 Remember to buy some apples 5
Text_4 No match here 3
With that DF I want to calculate the number of words that all rows have in common with one another. For example Text_1
and Text_2
have two words in common while Text_1
and Text_3
have just one.
Once I have that, I need to display the data in a matrix similar to this one:
ID Text_1 Text_2 Text_3 Text_4
Text_1 4 2 1 0
Text_2 2 4 0 0
Text_3 1 0 5 0
Text_4 0 0 0 3
I managed to do this with only two rows, for example Text_1
and Text_2
:
Text_1 = df[1, 2]
Text_2 = df[2, 2]
Text_1_split <- unlist(strsplit(Text_1, split =" "))
Text_2_split <- unlist(strsplit(Text_2, split =" "))
count = length(intersect(Text_1_split, Text_2_split))
count
[1] 2
But I don't know how to apply this sistematically for all rows and then display the matrix I need.
Any help would be very much appreciated.
Upvotes: 3
Views: 202
Reputation: 1384
You're probably looking for the vapply
function. Consider the following:
vapply(df$ID,
function(x){
sapply(df$ID,
function(y){
x_split <- unlist(strsplit(df$Text[df$ID == x], split = " "))
y_split <- unlist(strsplit(df$Text[df$ID == y], split = " "))
return(length(intersect(x_split, y_split)))
})
},
integer(nrow(df)))
The vapply
function ("vector-apply") applies a function across a series of inputs and returns a vector in the form of its third argument (in this case, an integer of length equal to the length of your data input.
Upvotes: 3