Reputation: 1618
I have vector of sentences, say:
x = c("I like donut", "I like pizza", "I like donut and pizza")
I want to count combination of two words. Ideal output is a data-frame with 3 columns (word1, word2 and frequency), and would be something like this :
I like 3
I donut 2
I pizza 2
like donut 2
like pizza 2
donut pizza 1
donut and 1
pizza and 1
In the first records of output, freq = 3
because "I"
and "like"
occurs together 3 times: x[1]
, x[2]
and x[3]
.
Any advises are appreciated :)
Upvotes: 2
Views: 1948
Reputation: 42629
split
into words, sort
to identify pairs properly, get all pairs with combn
, paste
pairs to get space-separated pairs of words, use table
to get the frequencies, then put it all together.
Here's an example:
f <- function(x) {
pr <- unlist(
lapply(
strsplit(x, ' '),
function(i) combn(sort(i), 2, paste, collapse=' ')
)
)
tbl <- table(pr)
d <- do.call(rbind.data.frame, strsplit(names(tbl), ' '))
names(d) <- c('word1', 'word2')
d$Freq <- tbl
d
}
With your example data:
> f(x)
word1 word2 Freq
1 and donut 1
2 and I 1
3 and like 1
4 and pizza 1
5 donut I 2
6 donut like 2
7 donut pizza 1
8 I like 3
9 I pizza 2
10 like pizza 2
Upvotes: 6