Reputation: 458
I want to remove words that contains special character except c#/c++. I also like to remove url present in a sentence.
For Eg my input is:
x <- "Google in the la#d of What c# chell//oo grr+m my Website is: c++ http://www.url.com/bye"
what I am doing is
gsub("http://(\\S+)|\\s*[\\^w\\s]\\s*[^c#c++\\s]","",x)
My expected output is
"Google in the of What c# my Website c++"
But I am getting
"Google in the la#d of What c# chell//oo grr+m my Webte i c++ "
Upvotes: 1
Views: 1908
Reputation: 99331
How about this? It seems to do the trick. It seemed a bit easier to split up the string first with strsplit
. One example below uses grep
, and the other gsub
. They each use a different regular expression. Also, the arguments to grep
can be very useful at times.
> newX <-unlist(strsplit(x, "\\s"))
With grep
:
> newX2 <- grep("((^[a-z]{2,3}$)|[A-Z]{1})|(c#|(\\+{2}))", newX, value = TRUE)
> paste(newX2, collapse = " ")
[1] "Google in the of What c# my Website c++"
With gsub
. This is actually much easier...they key idea is to determine the pattern of where the punctuation shows up within the characters.
> paste(gsub("[a-z]{2,3}(:|#)|(\\+|//)[a-z{1}]", "", newX), collapse = " ")
[1] "Google in the of What c# my Website c++"
Upvotes: 3
Reputation: 42659
Here is a single regex that, while horribly ugly, does the job:
gsub('(?:^|(?<=\\s))(?:(c\\+\\+|c#)|http://[^\\s]*|[^\\s]*[#/:+]+[^\\s]*)(?:\\s|$)', '\\1', x, perl=TRUE)
## [1] "Google in the of What c# my Website c++"
This uses the expression [#/:+]
as the match for "special characters" other than those present in c#
and c++
.
Breaking this down:
First, a space must be present (but not actually matched) or it must be the beginning of the text for the match to begin: (?:^|(?<=\\s))
. The choice is presented as a non-capturing group with (?:)
. This is important as we want to capture c#
and c++
in the expression (later).
Next, a selection of three choices is given, with |
as separators: (?:(c\\+\\+|c#)|http://[^\\s]*|[^\\s]*[#/:+]+[^\\s]*)
. This choice is another non-capturing group.
The first two selections (actually one choice, but two possibilities for the match in the regex) matches c++
or c#
and captures the value with (c\\+\\+|c#)
. Otherwise, a URL representation may be matched with http://[^\\s]*
or a word with special character with [^\\s]*[#/:+]+[^\\s]*
. The URL or word with special character is not captured.
Finally, a space must be present or it must be the end of the string, as specified by (?:\s|$)the final non-capturing group: (?:\\s|$)
Then the whole expression is replaced by the first capture, which may be empty. If it is nonempty, the capture will contain the string c#
or c++
.
You do need perl=TRUE
for this expression to be valid.
Upvotes: 2