code_learner
code_learner

Reputation: 223

How to remove a pattern of text ending with a colon in R?

I have the following sentence

review <- C("1a. How long did it take for you to receive a personalized response to an internet or email inquiry made to THIS dealership?: Approx. It was very prompt however. 2f. Consideration of your time and responsiveness to your requests.: Were a little bit pushy but excellent otherwise 2g. Your satisfaction with the process of coming to an agreement on pricing.: Were willing to try to bring the price to a level that was acceptable to me. Please provide any additional comments regarding your recent sales experience.: Abel is awesome! Took care of everything from welcoming me into the dealership to making sure I got the car I wanted (even the color)! ")

I want to remove everything before :

I tried the following code,

gsub("^[^:]+:","",review)

However, it only removed first sentence ending with a colon

Expected results:

Approx. It was very prompt however. Were a little bit pushy but excellent otherwise Were willing to try to bring the price to a level that was acceptable to me. Abel is awesome! Took care of everything from welcoming me into the dealership to making sure I got the car I wanted (even the color)!

Any help or suggestions will be appreciated. Thank you.

Upvotes: 1

Views: 143

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626689

If the sentences are not complex and have no abbreviations you may use

gsub("(?:\\d+[a-zA-Z]\\.)?[^.?!:]*[?!.]:\\s*", "", review)

See the regex demo.

Note that you may further generalize it a bit by changing \\d+[a-zA-Z] to [0-9a-zA-Z]+ / [[:alnum:]]+ to match 1+ digits or letters.

Details

  • (?:\d+[a-zA-Z]\.)? - an optional sequence of
    • \d+ - 1+ digits
    • [a-zA-Z] - an ASCII letter
    • \. - a dot
  • [^.?!:]* - 0 or more chars other than ., ?, !, :
  • [?!.] - a ?, ! or .
  • : - a colon
  • \s* - 0+ whitespaces

R test:

> gsub("(?:\\d+[a-zA-Z]\\.)?[^.?!:]*[?!.]:\\s*", "", review)
[1] "Approx. It was very prompt however. Were a little bit pushy but excellent otherwise Were willing to try to bring the price to a level that was acceptable to me.Abel is awesome! Took care of everything from welcoming me into the dealership to making sure I got the car I wanted (even the color)! "

Extending to handle abbreviations

You may enumerate the exceptions if you add alternation:

gsub("(?:\\d+[a-zA-Z]\\.)?(?:i\\.?e\\.|[^.?!:])*[?!.]:\\s*", "", review)     
                          ^^^^^^^^^^^^^^^^^^^^^^ 

Here, (?:i\.?e\.|[^.?!:])* matches 0 or more ie. or i.e. substrings or any chars other than ., ?, ! or :.

See this demo.

Upvotes: 2

Related Questions