Reputation: 151
I would like to count documents in which two strings appear within a set distance, within 10 words of each other. Let's say 'German*' and 'War'. I do not want to count the times they appear in total, but only the number of documents in which the set appears (if it appears once, count it as one).
I know how to count documents that contain a word. But I am not sure whether I need to extract 10-grams and see whether the two words appear and then count this per document, or if there is a more efficient way.
Upvotes: 1
Views: 205
Reputation: 1778
Hereafter is a small function that tests if two words are closer than 100 characters in a text.
isclose = function(text){
test <- FALSE
limit <- 100 # Interval in char counts
match1 <- gregexpr('war', text)[[1]]
match2 <- gregexpr('German', text)[[1]]
for(i in 1:length(match1)){
for(j in 1:length(match2)){
if(abs(match1[i]-match2[j]) < limit) test <- TRUE
}
}
return(test)
}
It works fine but should be improved to count the amount of words instead of characters.
Upvotes: 1