Count Word Occurrence in Distance to Defined Term

Question

I have a vector of text strings, such as:

Sentences <- c("Lorem ipsum dolor sit amet, WORD consetetur LOOK sadipscing elitr, sed diam nonumy.",
               "Eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua.",
               "At vero eos LOOK et accusam et justo duo WORD dolores et ea rebum." ,
               "Stet clita kasd gubergren, no sea takimata sanctus est Lorem WORD ipsum dolor sit amet.",
               "Lorem ipsum dolor sit amet, consetetur sadipscing LOOK elitr, sed diam nonumy eirmod tempor.",
               "Invidunt ut labore et WORD dolore magna aliquyam erat, sed LOOK diam voluptua." ,
               "Duis autem vel eum iriure dolor in hendrerit in LOOK vulputate velit esse LOOK molestie consequat.",
               "El illum dolore eu feugiat nulla LOOK WORD",
               "Facilisis at LOOK vero eros et accumsan et WORD iusto LOOK odio dignissim quit.",
               "Blandit LOOK praesent WORD LOOK luptatum zzril delenit augue duis dolore te feugait nulla facilisi.")

I would like to COUNT the number of particular words (example: 'LOOK') with a maximum distance of n words (example: three) to a defined term (example: 'WORD'). In other words: How often a particular word occurs within a maximum distance of n words to a defined term.

The result should look like this (maximum distance: three):

Result <- c(1,0,0,0,0,0,0,1,1,2)

Thank you in advance.

Florian · Accepted Answer

Here is a possible solution. We write a function that takes as input a sentence, the words to compare, and a maximum distance, defaulted to three. We split that string to obtain a vector of words, and find the locations of both words in that vector. With expand.grid, we then create a data.frame that contains all combinations of word-locations, and finds out how often the distance is less than the max distance. That number is then returned.

word1='LOOK'
word2='WORD'

count_word_dist <- function(x,word1,word2,max_dist=3)
{
  x = strsplit(x," ")[[1]]
  w1 = which(x==word1)
  w2 = which(x==word2)
  if(length(w1) >0 & length(w2)>0)
    return(sum(with(expand.grid(w1,w2),abs(Var1-Var2))<=max_dist))
  else
    return(0)
}

result = unname(sapply(Sentences,function(y) {count_word_dist(y,word1,word2)}))

Output:

> result
 [1] 1 0 0 0 0 0 0 1 1 2

Hope this helps!

Count Word Occurrence in Distance to Defined Term

Answers (1)

Related Questions