regex to remove words that contains special character along with url in R

Question

I want to remove words that contains special character except c#/c++. I also like to remove url present in a sentence.

For Eg my input is:

x <- "Google in the la#d of What c#  chell//oo grr+m my Website is: c++ http://www.url.com/bye"

what I am doing is

gsub("http://(\S+)|\s*[\^w\s]\s*[^c#c++\s]","",x)

My expected output is

"Google in the of What c#  my Website c++"

But I am getting

"Google in the la#d of What c#  chell//oo grr+m my Webte i c++ "

Matthew Lundberg · Accepted Answer

Here is a single regex that, while horribly ugly, does the job:

gsub('(?:^|(?<=\s))(?:(c\+\+|c#)|http://[^\s]*|[^\s]*[#/:+]+[^\s]*)(?:\s|$)', '\1', x, perl=TRUE)
## [1] "Google in the of What c# my Website c++"

This uses the expression [#/:+] as the match for "special characters" other than those present in c# and c++.

Breaking this down:

First, a space must be present (but not actually matched) or it must be the beginning of the text for the match to begin: (?:^|(?<=\s)). The choice is presented as a non-capturing group with (?:). This is important as we want to capture c# and c++ in the expression (later).

Next, a selection of three choices is given, with | as separators: (?:(c\+\+|c#)|http://[^\s]*|[^\s]*[#/:+]+[^\s]*). This choice is another non-capturing group.

The first two selections (actually one choice, but two possibilities for the match in the regex) matches c++ or c# and captures the value with (c\+\+|c#). Otherwise, a URL representation may be matched with http://[^\s]* or a word with special character with [^\s]*[#/:+]+[^\s]*. The URL or word with special character is not captured.

Finally, a space must be present or it must be the end of the string, as specified by (?:\s|$)the final non-capturing group: (?:\s|$)

Then the whole expression is replaced by the first capture, which may be empty. If it is nonempty, the capture will contain the string c# or c++.

You do need perl=TRUE for this expression to be valid.

regex to remove words that contains special character along with url in R

Answers (2)

Related Questions