Reputation: 173

Removing all sentences that begin with a specific word

I have a dataset with a "Notes" column, which I'm trying to clean up with R. The notes look something like this:

Collected for 2 man-hours total. Cloudy, imminent storms.
Collected for 2 man-hours total. Rainy.
Collected 30 min w/2 staff for a total of 1 man-hour of sampling. Sunny.
..And so on

I want to remove all sentences that start with "Collected" but not any of the sentences that follow. The number of sentences that follow vary, e.g. from 0-4 sentences afterwards. I was trying to remove all combinations of Collected + (last word of the sentence) but there's too many combinations. Removing Collected + [.] removes all the subsequent sentences. Does anyone have any suggestions? Thank you in advance.

Upvotes: 1

Answers (2)

MKR

Reputation: 20085

An option using gsub can be as:

gsub("^Collected[^.]*\\. ","",df$Notes)

# [1] "Cloudy, imminent storms."
# [2] "Rainy."                  
# [3] "Sunny."

Regex explanation:

 - `^Collected`    : Starts with `Collected`
 - `[^.]*`         : Followed by anything other than `.`
 - `\\. `          : Ends with `.` and `space`.

Replace such matches with "".

Data:

df<-read.table(text=
"Notes
'Collected for 2 man-hours total. Cloudy, imminent storms.'
'Collected for 2 man-hours total. Rainy.'
'Collected 30 min w/2 staff for a total of 1 man-hour of sampling. Sunny.'",
header = TRUE, stringsAsFactors = FALSE)

Upvotes: 5

Alexey Ferapontov

Reputation: 5169

a = "Collected 30 min w/2 staff for a total of 1 man-hour of sampling. Sunny."
sub("^ ","",sub("Collected.*?\\.","",a))

> [1] "Sunny."

Or if you know that there will be a space after the period:

 sub("Collected.*?\\. ","",a)

Upvotes: 4

Removing all sentences that begin with a specific word

Answers (2)

Related Questions