Reputation: 6874
I have text, an example of which is as follows
Input
c(",At the end of the study everything was great\n,There is an funny looking thing somewhere but I didn't look at it too hard\nSome other sentence\n The test ended.",",Not sure how to get this regex sorted\nI don't know how to get rid of sentences between the two nearest carriage returns but without my head spinning\nHow do I do this")
The expected output is
,At the end of the study everything was great\n,Some other sentence\nThe test ended.
,Not sure how to get this regex sorted\n\nHow do I do this
I tried:
x[, y] <- gsub(".*[Bb]ut .*?(\\.|\n|:)", "", x[, y])
but it eradicated the whole sentence. How do I remove the phrase with 'but' in it and keep the rest of the phrases in each sentence?
Upvotes: 1
Views: 46
Reputation: 626747
You may use
x <- c(",At the end of the study everything was great\n,There is an funny looking thing somewhere but I didn't look at it too hard\nSome other sentence\n The test ended.", ",Not sure how to get this regex sorted\nI don't know how to get rid of sentences between the two nearest carriage returns but without my head spinning\nHow do I do this")
gsub(".*\\bbut\\b.*[\r\n]*", "", x, ignore.case=TRUE, perl=TRUE)
gsub("(?n).*\\bbut\\b.*[\r\n]*", "", x, ignore.case=TRUE)
See the R demo online
The PCRE pattern matches:
.*
- any 0+ chars other than line break chars, 0 or more, as many as possible\\bbut\\b
- a whole word but
(\b
are word boundaries).*
- any 0+ chars other than line break chars, 0 or more, as many as possible[\r\n]*
- 0 or more line break chars.Note that the first gsub
has a perl=TRUE
argument that makes R use the PCRE regex engine to parse the pattern, and .
does not match a line break char there. The second gsub
uses a TRE (default) regex engine, and one needs to use (?n)
inline modifier to make .
fail to match line break chars there.
Upvotes: 1
Reputation: 3259
Note that you mixed up "\n" and "/n", which I did correct.
My idea for a solution:
1) Simply catch all chars which are no linebreak ([^\n]) before and after the "but".
2) (Edit) To address the issue Wiktors found, we also have to check that no char ([^a-zA-Z]) is directly before or after the "but".
x <- c(",At the end of the study everything was great\n,There is an funny looking thing somewhere but I didn't look at it too hard\nSome other sentence\n The test ended.",
",Not sure how to get this regex sorted\nI don't know how to get rid of sentences between the two nearest carriage returns but without my head spinning\nHow do I do this")
> gsub("[^\n]*[^a-zA-Z]but[^a-zA-Z][^\n]*", "", x)
[1] ",At the end of the study everything was great\n\nSome other sentence\n The test ended."
[2] ",Not sure how to get this regex sorted\n\nHow do I do this"
Upvotes: 1