Reputation: 83
I have a text and a list of patterns:
text="By Gregory Crawford HONG KONG, Jan 1 (Reuter) - Lower interest rates should\\ boost loan growth for Hong Kong banks in 1996, but the sluggish\\ economy will limit profit next year, analysts said.\\ \"Overall profit growth for the sector next year will not be\\ fantastic,\\\\\\\\\\\" said Alan Hutcheson at Deutsche Morgan Grenfell.\\ \\\\\\\\\\\"On the property side, we're not expecting to see any major\\ resurgence in terms of demand for mortgages,\\\\\\\\\\\" he said."
patterns=c("boost","growth","fantastic")
which I then collapsed into:
patterns.col="\\bboost\\b|\\bgrowth\\b|\\bfantastic\\b"
I want to count how many times the words in patterns appear in text, excluding the instances in which they are preceded or followed (within the previous/next 5 words) by a negation "no", "not", "don't" or "won't".
In this case, my expected output would be:
#3
that is, "boost" and "growth" x2, while "fantastic" is not counted because it is preceded by "not".
How could I do that?
Right now, I do the simple matching as follows:
count=str_count(text,patterns.col)
Thanks!
Upvotes: 2
Views: 51
Reputation: 626927
I suggest following this logic:
The regex - PCRE, you need to use it with base R functions using perl=TRUE
- will look like
\b(?:not?|[dw]on't)(?:\s+\S+){0,4}\s+(?:boost|growth|fantastic)\b(*SKIP)(*F)|\b(?:boost|growth|fantastic)\b(?!(?:\s+\S+){0,4}\s+(?:not?|[dw]on't)\b)
See the regex demo.
You do not have to hardcode it, you can see some parts repeat, so, it makes sense to build it dynamically:
neg <- "(?:not?|[dw]on't)"
filler <- "(?:\\s+\\S+){0,4}"
keys <- "(?:boost|growth|fantastic)"
rx <- paste0("\\b", neg, filler, "\\s+", keys, "\\b(*SKIP)(*F)|\\b", keys, "\\b(?!", filler, "\\s+", neg, "\\b)")
So, the neg
part is the negation words, filler
is the optional 0-4 words, and keys
are the keywords.
The regex matches:
\b(?:not?|[dw]on't)
- word boundary + negation words (i.e. as whole words)(?:\s+\S+){0,4}
- zero to four sequences of 1+ whitespaces and then 1+ non-whitespaces\s+
- 1+ whitespaces(?:boost|growth|fantastic)\b
- keywords as whole words(*SKIP)(*F)
- if matched, discard the match and go on looking for a match from the end of the current non-successful match|
- or (what we will match in the end)\b(?:boost|growth|fantastic)\b
- whole word match for keywords
-(?!(?:\s+\S+){0,4}\s+(?:not?|[dw]on't)\b)
- not followed with zero to four sequences of 1+ whitespaces and then 1+ non-whitespaces, then 1+ whitespaces and a negation word as a whole word.All you need then is to run regmatches
/gregexpr
:
matches <- regmatches(text, gregexpr(rx, text, perl=TRUE))
sapply(matches, length)
## => [1] 3
Upvotes: 1
Reputation: 32548
negatives = c("no", "not", "don't", "won't")
#Clean up text
x = gsub("[\\\\|,|\"|.]", "", text)
x = gsub("\\s+", " ", x)
x = unlist(strsplit(x, " "))
ind1 = which(x %in% negatives)
ind2 = which(x %in% patterns)
remove = sum(rowSums(sapply(ind1, function(x) sapply(ind2, function(y) abs(x - y) <= 5))) > 0)
add = length(ind2)
ans = add - remove
ans
#[1] 3
Upvotes: 2