user2327621
user2327621

Reputation: 997

R regex matching for tweet pattern

I am trying to use the regex feature in R to parse some tweet text into its key words. I have the following code.

sentence = gsub("[[:punct:]]", "", sentence)
sentence = gsub("[[:cntrl:]]", "", sentence)
sentence = gsub("\\d+", "", sentence)
sentence = tolower(sentence)

However, one of my sentences has the sequence "\ud83d\udc4b". THe parsing fails for this sequence (the error is "invalid input in utf8towcs"). I would like to replace such sequences with "". I tried substituting the regex "\u+", but that did not match. What is the regex I should use to match this sequence? Thanks.

Upvotes: 1

Views: 126

Answers (3)

Tyler Rinker
Tyler Rinker

Reputation: 110054

The qdapRegex package has the rm_non_ascii function to handle this:

library(qdapRegex)
tolower(rm_non_ascii(s))

## [1] "delta"

Upvotes: 0

MasterJedi
MasterJedi

Reputation: 1638

> sentence = RemoveNotASCII(sentence)

A function to remove not ASCII characters below.

RemoveNotASCII <- function#Remove all non ASCII characters
### remove column by columns non ASCII characters from a dataframe
(
  x ##<< dataframe
){
  n <- ncol(x)
  z <- list()
  for (j in 1:n) {
    y = as.character(x[,j])
    if (class(y)=="character") {
      Encoding(y) <- "latin1"
      y <- iconv(y, "latin1", "ASCII", sub="")
    }
    z[[j]] <- y
  }
  z = do.call("cbind.data.frame", z)
  names(z) <- names(x)
  return(z)
  ### Dataframe with non ASCII characters removed
}

Upvotes: 0

Avinash Raj
Avinash Raj

Reputation: 174844

I think you want something like this,

> s <- "\ud83d\udc4b Delta"
> Encoding(s)
[1] "UTF-8"
> iconv(s, "ASCII", sub="")
[1] " Delta"
> f <- iconv(s, "ASCII", sub="")
> sentence = tolower(f)
> sentence
[1] " delta"

Upvotes: 4

Related Questions