Afiq Johari
Afiq Johari

Reputation: 1462

text cleaning gsub remove everything until newline is found

have <- ('Good luck!!!

          ___________________
          Disclaimer: This email, including attachment ....
          .............
          
          Great!!!   
         ')

have <- ('Good luck!!!
          Great!!!
         ')

I'm dealing with email-like dataset which I want to clean before further analysis. There're some constant structures such as a Disclaimer section which is preceded and followed by a newline, which I think should be possible with regex. But the length of the disclaimer may vary due to truncation.

What I've tried currently is below

gsub(pattern = 'Disclaimer([\\s\\S]*)[\\n|\\r\\n|\\r]{2}', replacement = '', have)

Upvotes: 1

Views: 225

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627044

You can use

have <- trimws(gsub("(?m)^\\s*_{3,}\\R\\h*Disclaimer:.*(?:\\R.*\\S.*)*+\\s*", "", have, perl=TRUE))

See the R demo. Here is a regex demo.

Details:

  • (?m) - multiline mode on ()
  • ^ - start of a line
  • \s* - any zero or more whitespace chars
  • _{3,} - three or more _s
  • \R - a line break
  • \h* - zero or more horizontal whitespaces
  • Disclaimer: - a text
  • .* - the rest of the line
  • (?:\R.*\S.*)*+ - zero or more non-blank lines
  • \s* - any zero or more whitespace chars.

Upvotes: 1

Related Questions