Esperanta
Esperanta

Reputation: 83

How to count how many substrings match at least one element of a list, only if they are not preceded or followed by a negation?

I have a text and a list of patterns:

text="By Gregory Crawford HONG KONG, Jan 1 (Reuter) - Lower interest rates should\\ boost loan growth for Hong Kong banks in 1996, but the sluggish\\ economy will limit profit next year, analysts said.\\  \"Overall profit growth for the sector next year will not be\\ fantastic,\\\\\\\\\\\" said Alan Hutcheson at Deutsche Morgan Grenfell.\\     \\\\\\\\\\\"On the property side, we're not expecting to see any major\\ resurgence in terms of demand for mortgages,\\\\\\\\\\\" he said."
patterns=c("boost","growth","fantastic")

which I then collapsed into:

patterns.col="\\bboost\\b|\\bgrowth\\b|\\bfantastic\\b"

I want to count how many times the words in patterns appear in text, excluding the instances in which they are preceded or followed (within the previous/next 5 words) by a negation "no", "not", "don't" or "won't".

In this case, my expected output would be:

#3

that is, "boost" and "growth" x2, while "fantastic" is not counted because it is preceded by "not".

How could I do that?

Right now, I do the simple matching as follows:

count=str_count(text,patterns.col)

Thanks!

Upvotes: 2

Views: 51

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626927

I suggest following this logic:

  • If there is a negation, then 0 or 4 words (whitespace chunks) followed with one of the keywords, discard this match and proceed looking for another match as usual, from left to right
  • If another match is found and there is no negation words after 0 or more "words" (non-whitespace chunks), take it and count.

The regex - PCRE, you need to use it with base R functions using perl=TRUE - will look like

\b(?:not?|[dw]on't)(?:\s+\S+){0,4}\s+(?:boost|growth|fantastic)\b(*SKIP)(*F)|\b(?:boost|growth|fantastic)\b(?!(?:\s+\S+){0,4}\s+(?:not?|[dw]on't)\b)

See the regex demo.

You do not have to hardcode it, you can see some parts repeat, so, it makes sense to build it dynamically:

neg <- "(?:not?|[dw]on't)"
filler <- "(?:\\s+\\S+){0,4}"
keys <- "(?:boost|growth|fantastic)"
rx <- paste0("\\b", neg, filler, "\\s+", keys, "\\b(*SKIP)(*F)|\\b", keys, "\\b(?!", filler, "\\s+", neg, "\\b)")

So, the neg part is the negation words, filler is the optional 0-4 words, and keys are the keywords.

The regex matches:

  • \b(?:not?|[dw]on't) - word boundary + negation words (i.e. as whole words)
  • (?:\s+\S+){0,4} - zero to four sequences of 1+ whitespaces and then 1+ non-whitespaces
  • \s+ - 1+ whitespaces
  • (?:boost|growth|fantastic)\b - keywords as whole words
  • (*SKIP)(*F) - if matched, discard the match and go on looking for a match from the end of the current non-successful match
  • | - or (what we will match in the end)
  • \b(?:boost|growth|fantastic)\b - whole word match for keywords -(?!(?:\s+\S+){0,4}\s+(?:not?|[dw]on't)\b) - not followed with zero to four sequences of 1+ whitespaces and then 1+ non-whitespaces, then 1+ whitespaces and a negation word as a whole word.

All you need then is to run regmatches/gregexpr:

matches <- regmatches(text, gregexpr(rx, text, perl=TRUE))
sapply(matches, length)
## => [1] 3

Upvotes: 1

d.b
d.b

Reputation: 32548

negatives = c("no", "not", "don't", "won't")

#Clean up text
x = gsub("[\\\\|,|\"|.]", "", text)
x = gsub("\\s+", " ", x)
x = unlist(strsplit(x, " "))

ind1 = which(x %in% negatives)
ind2 = which(x %in% patterns)

remove = sum(rowSums(sapply(ind1, function(x) sapply(ind2, function(y) abs(x - y) <= 5))) > 0)
add = length(ind2)

ans = add - remove
ans
#[1] 3

Upvotes: 2

Related Questions