Sebastian Zeki
Sebastian Zeki

Reputation: 6874

how to remove sentences with conjuctions in R

I have text, an example of which is as follows

Input

c(",At the end of the study everything was great\n,There is an funny looking thing somewhere but I didn't look at it too hard\nSome other sentence\n The test ended.",",Not sure how to get this regex sorted\nI don't know how to get rid of sentences between the two nearest carriage returns but without my head spinning\nHow do I do this")

The expected output is

,At the end of the study everything was great\n,Some other sentence\nThe test ended.
,Not sure how to get this regex sorted\n\nHow do I do this

I tried:

  x[, y] <- gsub(".*[Bb]ut .*?(\\.|\n|:)", "", x[, y])

but it eradicated the whole sentence. How do I remove the phrase with 'but' in it and keep the rest of the phrases in each sentence?

Upvotes: 1

Views: 46

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626747

You may use

x <- c(",At the end of the study everything was great\n,There is an funny looking thing somewhere but I didn't look at it too hard\nSome other sentence\n The test ended.", ",Not sure how to get this regex sorted\nI don't know how to get rid of sentences between the two nearest carriage returns but without my head spinning\nHow do I do this")
gsub(".*\\bbut\\b.*[\r\n]*", "", x, ignore.case=TRUE, perl=TRUE)
gsub("(?n).*\\bbut\\b.*[\r\n]*", "", x, ignore.case=TRUE)

See the R demo online

The PCRE pattern matches:

  • .* - any 0+ chars other than line break chars, 0 or more, as many as possible
  • \\bbut\\b - a whole word but (\b are word boundaries)
  • .* - any 0+ chars other than line break chars, 0 or more, as many as possible
  • [\r\n]* - 0 or more line break chars.

Note that the first gsub has a perl=TRUE argument that makes R use the PCRE regex engine to parse the pattern, and . does not match a line break char there. The second gsub uses a TRE (default) regex engine, and one needs to use (?n) inline modifier to make . fail to match line break chars there.

Upvotes: 1

Patrick Roocks
Patrick Roocks

Reputation: 3259

Note that you mixed up "\n" and "/n", which I did correct.

My idea for a solution:

1) Simply catch all chars which are no linebreak ([^\n]) before and after the "but".

2) (Edit) To address the issue Wiktors found, we also have to check that no char ([^a-zA-Z]) is directly before or after the "but".

x <- c(",At the end of the study everything was great\n,There is an funny looking thing somewhere but I didn't look at it too hard\nSome other sentence\n The test ended.",
       ",Not sure how to get this regex sorted\nI don't know how to get rid of sentences between the two nearest carriage returns but without my head spinning\nHow do I do this")

> gsub("[^\n]*[^a-zA-Z]but[^a-zA-Z][^\n]*", "", x)
[1] ",At the end of the study everything was great\n\nSome other sentence\n The test ended."
[2] ",Not sure how to get this regex sorted\n\nHow do I do this" 

Upvotes: 1

Related Questions