Gabriel Voelcker
Gabriel Voelcker

Reputation: 1

Unnest non-consecutive tokens in R

Suppose I have a few sentences describing how John spends his days stored in a dataframe in R:

df <- data_frame(sentence = c("John went to work this morning", "John likes to jog", "John is hungry"))

Thus, I want to identify what words are more often repeated when a sentence contains "John". I can use unnest_tokens() to identify consecutive words. How can I identify recurring pairings that are non consecutive?

The goal is to obtain a result that counts how many times each other word appears close to John:

df2 <- data_frame(word1 = c("John", "John", "John", "John", "John", "John", "John", "John", "John"),
                 word2 = c("went", "to", "work", "this", "morning", "likes", "jog", "is", "hungry"),
                 n = c(1, 2, 1, 1, 1, 1, 1, 1, 1))

Upvotes: 0

Views: 37

Answers (1)

Mohamed Desouky
Mohamed Desouky

Reputation: 4425

  • We can try
library(dplyr)

lst <- lapply(strsplit(df$sentence , " ") , \(x) list(x[1] , x[-1])) |>
       lapply(\(x) data.frame(x[1], x[2]))

ans <- lapply(lst , \(x) {colnames(x) <- c("word1" , "word2") ;x}) |> 
       do.call(rbind , args = _) |> group_by(word1 , word2) |>
       summarise(n = n())

  • Output
# A tibble: 9 × 3
# Groups:   word1 [1]
  word1 word2       n
  <chr> <chr>   <int>
1 John  hungry      1
2 John  is          1
3 John  jog         1
4 John  likes       1
5 John  morning     1
6 John  this        1
7 John  to          2
8 John  went        1
9 John  work        1

Upvotes: 0

Related Questions