Melvin Wevers
Melvin Wevers

Reputation: 151

How to count documents in which two words appear in close proximity in R?

I would like to count documents in which two strings appear within a set distance, within 10 words of each other. Let's say 'German*' and 'War'. I do not want to count the times they appear in total, but only the number of documents in which the set appears (if it appears once, count it as one).

I know how to count documents that contain a word. But I am not sure whether I need to extract 10-grams and see whether the two words appear and then count this per document, or if there is a more efficient way.

Upvotes: 1

Views: 205

Answers (1)

PetitJean
PetitJean

Reputation: 1778

Hereafter is a small function that tests if two words are closer than 100 characters in a text.

isclose = function(text){
  test <- FALSE
  limit <- 100 # Interval in char counts
  match1 <- gregexpr('war', text)[[1]]
  match2 <- gregexpr('German', text)[[1]]

  for(i in 1:length(match1)){
    for(j in 1:length(match2)){
      if(abs(match1[i]-match2[j]) < limit) test <- TRUE
    }
  }
  return(test)
}

It works fine but should be improved to count the amount of words instead of characters.

Upvotes: 1

Related Questions