how to remove sentences with conjuctions in R

Question

I have text, an example of which is as follows

Input

c(",At the end of the study everything was great
,There is an funny looking thing somewhere but I didn't look at it too hard
Some other sentence
 The test ended.",",Not sure how to get this regex sorted
I don't know how to get rid of sentences between the two nearest carriage returns but without my head spinning
How do I do this")

The expected output is

,At the end of the study everything was great
,Some other sentence
The test ended.
,Not sure how to get this regex sorted

How do I do this

I tried:

  x[, y] <- gsub(".*[Bb]ut .*?(\.|
|:)", "", x[, y])

but it eradicated the whole sentence. How do I remove the phrase with 'but' in it and keep the rest of the phrases in each sentence?

Wiktor Stribiżew · Accepted Answer

You may use

x <- c(",At the end of the study everything was great
,There is an funny looking thing somewhere but I didn't look at it too hard
Some other sentence
 The test ended.", ",Not sure how to get this regex sorted
I don't know how to get rid of sentences between the two nearest carriage returns but without my head spinning
How do I do this")
gsub(".*\bbut\b.*[
]*", "", x, ignore.case=TRUE, perl=TRUE)
gsub("(?n).*\bbut\b.*[
]*", "", x, ignore.case=TRUE)

See the R demo online

The PCRE pattern matches:

.* - any 0+ chars other than line break chars, 0 or more, as many as possible
\bbut\b - a whole word but (\b are word boundaries)
.* - any 0+ chars other than line break chars, 0 or more, as many as possible
[ ]* - 0 or more line break chars.

Note that the first gsub has a perl=TRUE argument that makes R use the PCRE regex engine to parse the pattern, and . does not match a line break char there. The second gsub uses a TRE (default) regex engine, and one needs to use (?n) inline modifier to make . fail to match line break chars there.

how to remove sentences with conjuctions in R

Answers (2)

Related Questions