Reputation: 1948
I need help with regex in R. I have a bunch of strings each of which has a structure similar to this one:
mytext <- "\"Dimitri. It has absolutely no meaning,\": Allow me to him|\"realize that\": Poor Alice! It |\"HIGHLIGHT A LOT OF THINGS. Our team is small and if each person highlights only 1 or 2 things, the counts of Likes\": |\"same for the Dislikes. Thank you very much for completing this\": ME.' 'You!' sai"
Notice that this strings contains substrings within "" followed by a ":" and some text without quotation marks - until we encounter a "|" - then a new quotation mark appears etc.
Notice also that at the very end there is text after a ":" - but at the VERY end there is no "|"
My objective is to completely eliminate all text starting with any ":" (and INCLUDING ":") and until the next "|" (but "|" has to stay). I also need to eliminate all text that comes after the very last ":"
Finally (that's more of a bonus) - I want to get rid of all "\" characters and all quotation marks - because in the final solution I need to have "clean text": A bunch of strings separated only by "|" characters.
Is it possible?
Here is my awkward first attempt:
gsub('\\:.*?\\|', '', mytext)
Upvotes: 0
Views: 1189
Reputation: 17611
With a single gsub
you can match text after a :
(including the :
), so long as it doesn't contain a pipe: :[^|]*
. This matches the case at the end of the string, too. You can also match double quotes by searching for another pattern after the alternation character (|
): [\"]
gsub(":[^|]*|[\"]", "", mytext)
#[1] "Dimitri. It has absolutely no meaning,|realize that|HIGHLIGHT A LOT OF THINGS. Our team is small and if each person highlights only 1 or 2 things, the counts of Likes|same for the Dislikes. Thank you very much for completing this"
Upvotes: 1
Reputation: 38500
This method uses 3 passes of g?sub
.
sub("\\|$", "", gsub("[\\\\\"]", "", gsub(":.*?(\\||$)", "|", mytext)))
[1] "Dimitri. It has absolutely no meaning,|realize that|HIGHLIGHT A LOT OF THINGS. Our team is small and if each person highlights only 1 or 2 things, the counts of Likes|same for the Dislikes. Thank you very much for completing this"
The first strips out the text in between ":" and "|" inclusive and replaces it with "|". The second pass removes "\" and """ and the third pass removes the "|" at the end.
Upvotes: 2