screechOwl
screechOwl

Reputation: 28169

R getting rid of single quote character

I have a vector of character strings I'm trying to process, but I can't get rid of some weird characters.

When I read the csv file I used the following line:

train <- read.csv(file="files/file1.csv", header = T, encoding = "UTF-8")

I used this line to try and get rid of punctuation:

train$var1 <- gsub("[[:punct:]]", " ", train$var1)

However on inspection after running it, I'm still seeing weird single quotes, '...', and black dots like a password cloaking character. Here's the dput:

dput(unique(unlist(var1List))[c(30242:30246, 30561, 30484)])
c("opportunity…", "about…", "expected…", "reward…", "us…", "‘as", 
"<U+25CF>")

Any suggestions for getting rid of these characters?

Upvotes: 0

Views: 457

Answers (1)

Tim Pietzcker
Tim Pietzcker

Reputation: 336378

You could remove everything except a set of legal characters:

train$var1 <- gsub("[^\\w\\s]", " ", train$var1, perl = TRUE)

would change every character that's not an alphanumeric or a whitespace character into a space, for example.

Upvotes: 5

Related Questions