Reputation: 1
Suppose I have a few sentences describing how John spends his days stored in a dataframe in R:
df <- data_frame(sentence = c("John went to work this morning", "John likes to jog", "John is hungry"))
Thus, I want to identify what words are more often repeated when a sentence contains "John". I can use unnest_tokens() to identify consecutive words. How can I identify recurring pairings that are non consecutive?
The goal is to obtain a result that counts how many times each other word appears close to John:
df2 <- data_frame(word1 = c("John", "John", "John", "John", "John", "John", "John", "John", "John"),
word2 = c("went", "to", "work", "this", "morning", "likes", "jog", "is", "hungry"),
n = c(1, 2, 1, 1, 1, 1, 1, 1, 1))
Upvotes: 0
Views: 37
Reputation: 4425
library(dplyr)
lst <- lapply(strsplit(df$sentence , " ") , \(x) list(x[1] , x[-1])) |>
lapply(\(x) data.frame(x[1], x[2]))
ans <- lapply(lst , \(x) {colnames(x) <- c("word1" , "word2") ;x}) |>
do.call(rbind , args = _) |> group_by(word1 , word2) |>
summarise(n = n())
# A tibble: 9 × 3
# Groups: word1 [1]
word1 word2 n
<chr> <chr> <int>
1 John hungry 1
2 John is 1
3 John jog 1
4 John likes 1
5 John morning 1
6 John this 1
7 John to 2
8 John went 1
9 John work 1
Upvotes: 0