Reputation: 173
I have a dataset with a "Notes" column, which I'm trying to clean up with R. The notes look something like this:
I want to remove all sentences that start with "Collected" but not any of the sentences that follow. The number of sentences that follow vary, e.g. from 0-4 sentences afterwards. I was trying to remove all combinations of Collected + (last word of the sentence) but there's too many combinations. Removing Collected + [.] removes all the subsequent sentences. Does anyone have any suggestions? Thank you in advance.
Upvotes: 1
Views: 573
Reputation: 20085
An option using gsub
can be as:
gsub("^Collected[^.]*\\. ","",df$Notes)
# [1] "Cloudy, imminent storms."
# [2] "Rainy."
# [3] "Sunny."
Regex explanation: - `^Collected` : Starts with `Collected` - `[^.]*` : Followed by anything other than `.` - `\\. ` : Ends with `.` and `space`.
Replace such matches with ""
.
Data:
df<-read.table(text=
"Notes
'Collected for 2 man-hours total. Cloudy, imminent storms.'
'Collected for 2 man-hours total. Rainy.'
'Collected 30 min w/2 staff for a total of 1 man-hour of sampling. Sunny.'",
header = TRUE, stringsAsFactors = FALSE)
Upvotes: 5
Reputation: 5169
a = "Collected 30 min w/2 staff for a total of 1 man-hour of sampling. Sunny."
sub("^ ","",sub("Collected.*?\\.","",a))
> [1] "Sunny."
Or if you know that there will be a space after the period:
sub("Collected.*?\\. ","",a)
Upvotes: 4