Reputation: 1613

How to remove texts after some sentences?

I have a dataframe with n rows that contain some text. Some of these rows contain extra text that I would like to remove and the extra text happens to show up after some specific sentences.

Let me take an example:

df = structure(list(Text = c("The text you see here is fine, no problem with this.", 
"The text you see here is fine, no problem with this.", "The text you see here is fine, no problem with this. We are now ready to take your questions. Life is great even if it is too hot to work at the moment.", 
"The text you see here is fine, no problem with this.", "The text you see here is fine, no problem with this.", 
"The text you see here is fine, no problem with this. We are now at your disposal for questions. I really need to remove this bit that comes after since I don't need it. Hopefully SE will sort this out.", 
"The text you see here is fine, no problem with this.", "The text you see here is fine, no problem with this.", 
"The text you see here is fine, no problem with this.", "The text you see here is fine, no problem with this. Transcript of the questions asked and the answers. Summertime is nice.", 
"The text you see here is fine, no problem with this.", "The text you see here is fine, no problem with this."
)), class = "data.frame", row.names = c(NA, -12L))

I would like to get:

#                                                               Text
# 1                                                     The text you see here is fine, no problem with this.
# 2                                                     The text you see here is fine, no problem with this.
# 3            The text you see here is fine, no problem with this. We are now ready to take your questions.
# 4                                                     The text you see here is fine, no problem with this.
# 5                                                     The text you see here is fine, no problem with this.
# 6          The text you see here is fine, no problem with this. We are now at your disposal for questions.
# 7                                                     The text you see here is fine, no problem with this.
# 8                                                     The text you see here is fine, no problem with this.
# 9                                                     The text you see here is fine, no problem with this.
# 10 The text you see here is fine, no problem with this. Transcript of the questions asked and the answers.
# 11                                                    The text you see here is fine, no problem with this.
# 12                                                    The text you see here is fine, no problem with this.

The dataframe is a simplified representation of the real one. The extra text (which is always the same in the example but it varies in the real one) comes always after the three sentences: We are now at your disposal for questions., Transcript of the questions asked and the answers. and We are now ready to take your questions.

Can anyone help me sort this out?

You would really make my day.

Thanks!

Upvotes: 2

Answers (3)

akrun

Reputation: 887158

We can use sub

df$Text <- sub("I really need to remove .*", "", df$Text)

We could create a pattern vector and use a for loop

patvec <- c("We are now at your disposal for questions.", 
    "Transcript of the questions asked and the answers.", 
  "We are now ready to take your questions.",
  "I really need to remove this bit that comes after since I don't need it.")

# // loop over the sequence of pattern vector
for(i in seq_along(patvec)) {
     # // create a regex pattern to capture the strings
     # // including the pattern vector elements
     tmppat <- paste0("^(.*", patvec[i], ").*")
     # // use sub with replacement on the captured group i.e. string inside (..)
     # // assign and update the column Text
     df$Text <- sub(tmppat, "\\1", df$Text)
  }

-output

df
                                                                                                      #Text
#1                                                     The text you see here is fine, no problem with this.
#2                                                     The text you see here is fine, no problem with this.
#3            The text you see here is fine, no problem with this. We are now ready to take your questions.
#4                                                     The text you see here is fine, no problem with this.
#5                                                     The text you see here is fine, no problem with this.
#6          The text you see here is fine, no problem with this. We are now at your disposal for questions.
#7                                                     The text you see here is fine, no problem with this.
#8                                                     The text you see here is fine, no problem with this.
#9                                                     The text you see here is fine, no problem with this.
#10 The text you see here is fine, no problem with this. Transcript of the questions asked and the answers.
#11                                                    The text you see here is fine, no problem with this.
#12                                                    The text you see here is fine, no problem with this.

NOTE: This should work fine even if there are hundreds of thousands of pattern vector elements

Upvotes: 1

Darren Tsai

Reputation: 35554

You can use the syntax "(?<=a|b|c)text" in regular expressions to match what you want to remove.

patvec <- c("We are now at your disposal for questions.", 
            "Transcript of the questions asked and the answers.", 
            "We are now ready to take your questions.",
            "I really need to remove this bit that comes after since I don't need it.")

regex <- sprintf("(?<=%s).*", paste(patvec, collapse = "|"))
sub(regex, "", df$Text, perl = T)

#  [1] "The text you see here is fine, no problem with this."                                                   
#  [2] "The text you see here is fine, no problem with this."                                                   
#  [3] "The text you see here is fine, no problem with this. We are now ready to take your questions."          
#  [4] "The text you see here is fine, no problem with this."                                                   
#  [5] "The text you see here is fine, no problem with this."                                                   
#  [6] "The text you see here is fine, no problem with this. We are now at your disposal for questions."        
#  [7] "The text you see here is fine, no problem with this."                                                   
#  [8] "The text you see here is fine, no problem with this."                                                   
#  [9] "The text you see here is fine, no problem with this."                                                   
# [10] "The text you see here is fine, no problem with this. Transcript of the questions asked and the answers."
# [11] "The text you see here is fine, no problem with this."                                                   
# [12] "The text you see here is fine, no problem with this."

Upvotes: 1

Mike V

Reputation: 1364

You can try this one

df2 <- df %>% 
  distinct(Text) %>% 
  mutate(Text = str_replace_all(Text, regex("I really need to .*"), ""))
df2
# Text
# 1                                                     The text you see here is fine, no problem with this.
# 2           The text you see here is fine, no problem with this. We are now ready to take your questions. 
# 3         The text you see here is fine, no problem with this. We are now at your disposal for questions. 
# 4 The text you see here is fine, no problem with this. Transcript of the questions asked and the answers.

Upvotes: 1

How to remove texts after some sentences?

Answers (3)

Related Questions