user3245256
user3245256

Reputation: 1948

Remove several strings between two specific characters

I need help with regex in R. I have a bunch of strings each of which has a structure similar to this one:

mytext <- "\"Dimitri. It has absolutely no meaning,\": Allow me to him|\"realize that\": Poor Alice! It |\"HIGHLIGHT A LOT OF THINGS. Our team is small and if each person highlights only 1 or 2 things, the counts of Likes\": |\"same for the Dislikes.  Thank you very much for completing this\": ME.' 'You!' sai"

Notice that this strings contains substrings within "" followed by a ":" and some text without quotation marks - until we encounter a "|" - then a new quotation mark appears etc.

Notice also that at the very end there is text after a ":" - but at the VERY end there is no "|"

My objective is to completely eliminate all text starting with any ":" (and INCLUDING ":") and until the next "|" (but "|" has to stay). I also need to eliminate all text that comes after the very last ":"

Finally (that's more of a bonus) - I want to get rid of all "\" characters and all quotation marks - because in the final solution I need to have "clean text": A bunch of strings separated only by "|" characters.

Is it possible?

Here is my awkward first attempt:

gsub('\\:.*?\\|', '', mytext)

Upvotes: 0

Views: 1189

Answers (2)

Jota
Jota

Reputation: 17611

With a single gsub you can match text after a : (including the :), so long as it doesn't contain a pipe: :[^|]*. This matches the case at the end of the string, too. You can also match double quotes by searching for another pattern after the alternation character (|): [\"]

gsub(":[^|]*|[\"]", "", mytext)
#[1] "Dimitri. It has absolutely no meaning,|realize that|HIGHLIGHT A LOT OF THINGS. Our team is small and if each person highlights only 1 or 2 things, the counts of Likes|same for the Dislikes.  Thank you very much for completing this"

Upvotes: 1

lmo
lmo

Reputation: 38500

This method uses 3 passes of g?sub.

sub("\\|$", "", gsub("[\\\\\"]", "", gsub(":.*?(\\||$)", "|", mytext)))
[1] "Dimitri. It has absolutely no meaning,|realize that|HIGHLIGHT A LOT OF THINGS. Our team is small and if each person highlights only 1 or 2 things, the counts of Likes|same for the Dislikes.  Thank you very much for completing this"

The first strips out the text in between ":" and "|" inclusive and replaces it with "|". The second pass removes "\" and """ and the third pass removes the "|" at the end.

Upvotes: 2

Related Questions