Reputation: 997
I am trying to use the regex feature in R to parse some tweet text into its key words. I have the following code.
sentence = gsub("[[:punct:]]", "", sentence)
sentence = gsub("[[:cntrl:]]", "", sentence)
sentence = gsub("\\d+", "", sentence)
sentence = tolower(sentence)
However, one of my sentences has the sequence "\ud83d\udc4b". THe parsing fails for this sequence (the error is "invalid input in utf8towcs"). I would like to replace such sequences with "". I tried substituting the regex "\u+", but that did not match. What is the regex I should use to match this sequence? Thanks.
Upvotes: 1
Views: 126
Reputation: 110054
The qdapRegex
package has the rm_non_ascii
function to handle this:
library(qdapRegex)
tolower(rm_non_ascii(s))
## [1] "delta"
Upvotes: 0
Reputation: 1638
> sentence = RemoveNotASCII(sentence)
A function to remove not ASCII characters below.
RemoveNotASCII <- function#Remove all non ASCII characters
### remove column by columns non ASCII characters from a dataframe
(
x ##<< dataframe
){
n <- ncol(x)
z <- list()
for (j in 1:n) {
y = as.character(x[,j])
if (class(y)=="character") {
Encoding(y) <- "latin1"
y <- iconv(y, "latin1", "ASCII", sub="")
}
z[[j]] <- y
}
z = do.call("cbind.data.frame", z)
names(z) <- names(x)
return(z)
### Dataframe with non ASCII characters removed
}
Upvotes: 0
Reputation: 174844
I think you want something like this,
> s <- "\ud83d\udc4b Delta"
> Encoding(s)
[1] "UTF-8"
> iconv(s, "ASCII", sub="")
[1] " Delta"
> f <- iconv(s, "ASCII", sub="")
> sentence = tolower(f)
> sentence
[1] " delta"
Upvotes: 4