text cleaning gsub remove everything until newline is found

Question

have <- ('Good luck!!!

          ___________________
          Disclaimer: This email, including attachment ....
          .............
          
          Great!!!   
         ')

have <- ('Good luck!!!
          Great!!!
         ')

I'm dealing with email-like dataset which I want to clean before further analysis. There're some constant structures such as a Disclaimer section which is preceded and followed by a newline, which I think should be possible with regex. But the length of the disclaimer may vary due to truncation.

What I've tried currently is below

gsub(pattern = 'Disclaimer([\s\S]*)[\n|\r\n|\r]{2}', replacement = '', have)

Wiktor Stribiżew · Accepted Answer

You can use

have <- trimws(gsub("(?m)^\s*_{3,}\R\h*Disclaimer:.*(?:\R.*\S.*)*+\s*", "", have, perl=TRUE))

See the R demo. Here is a regex demo.

Details:

(?m) - multiline mode on ()
^ - start of a line
\s* - any zero or more whitespace chars
_{3,} - three or more _s
\R - a line break
\h* - zero or more horizontal whitespaces
Disclaimer: - a text
.* - the rest of the line
(?:\R.*\S.*)*+ - zero or more non-blank lines
\s* - any zero or more whitespace chars.

text cleaning gsub remove everything until newline is found

Answers (1)

Related Questions