Reputation: 1935
I have some sentences like this one.
c = "In Acid-base reaction (page[4]), why does it create water and not H+?"
I want to remove all special characters except for '?&+-/
I know that if I want to remove all special characters, I can simply use
gsub("[[:punct:]]", "", c)
"In Acidbase reaction page4 why does it create water and not H"
However, some special characters such as + - ? are also removed, which I intend to keep.
I tried to create a string of special characters that I can use in some code like this
gsub("[special_string]", "", c)
The best I can do is to come up with this
cat("!\"#$%()*,.:;<=>@[\\]^_`{|}~.")
However, the following code just won't work
gsub("[cat("!\"#$%()*,.:;<=>@[\\]^_`{|}~.")]", "", c)
What should I do to remove special characters, except for a few that I want to keep?
Thanks
Upvotes: 19
Views: 77409
Reputation: 263301
In order to get your method to work, you need to put the literal "]" immediately after the leading "["
gsub("[][!#$%()*,.:;<=>@^_`|~.{}]", "", c)
[1] "In Acid-base reaction page4 why does it create water and not H+?"
You can them put the inner "[" anywhere. If you needed to exclude minus, it would then need to be last. See the ?regex page after all of those special pre-defined character classes are listed.
Upvotes: 7
Reputation: 52637
gsub("[^[:alnum:][:blank:]+?&/\\-]", "", c)
# [1] "In Acid-base reaction page4 why does it create water and not H+?"
Upvotes: 29
Reputation: 109844
I think you're after a regex solution. I'll give you a messy solution and a package add on solution (shameless self promotion).
There's likely a better regex:
x <- "In Acid-base reaction (page[4]), why does it create water and not H+?"
keeps <- c("+", "-", "?")
## Regex solution
gsub(paste0(".*?($|'|", paste(paste0("\\",
keeps), collapse = "|"), "|[^[:punct:]]).*?"), "\\1", x)
#qdap: addon package solution
library(qdap)
strip(x, keeps, lower = FALSE)
## [1] "In Acid-base reaction page why does it create water and not H+?"
Upvotes: 5