Reputation: 1462
have <- ('Good luck!!!
___________________
Disclaimer: This email, including attachment ....
.............
Great!!!
')
have <- ('Good luck!!!
Great!!!
')
I'm dealing with email-like dataset which I want to clean before further analysis.
There're some constant structures such as a Disclaimer
section which is preceded and followed by a newline, which I think should be possible with regex. But the length of the disclaimer may vary due to truncation.
What I've tried currently is below
gsub(pattern = 'Disclaimer([\\s\\S]*)[\\n|\\r\\n|\\r]{2}', replacement = '', have)
Upvotes: 1
Views: 225
Reputation: 627044
You can use
have <- trimws(gsub("(?m)^\\s*_{3,}\\R\\h*Disclaimer:.*(?:\\R.*\\S.*)*+\\s*", "", have, perl=TRUE))
See the R demo. Here is a regex demo.
Details:
(?m)
- multiline mode on ()^
- start of a line\s*
- any zero or more whitespace chars_{3,}
- three or more _
s\R
- a line break\h*
- zero or more horizontal whitespacesDisclaimer:
- a text.*
- the rest of the line(?:\R.*\S.*)*+
- zero or more non-blank lines\s*
- any zero or more whitespace chars.Upvotes: 1