Reputation: 28169
I have a vector of character strings I'm trying to process, but I can't get rid of some weird characters.
When I read the csv file I used the following line:
train <- read.csv(file="files/file1.csv", header = T, encoding = "UTF-8")
I used this line to try and get rid of punctuation:
train$var1 <- gsub("[[:punct:]]", " ", train$var1)
However on inspection after running it, I'm still seeing weird single quotes, '...', and black dots like a password cloaking character. Here's the dput:
dput(unique(unlist(var1List))[c(30242:30246, 30561, 30484)])
c("opportunity…", "about…", "expected…", "reward…", "us…", "‘as",
"<U+25CF>")
Any suggestions for getting rid of these characters?
Upvotes: 0
Views: 457
Reputation: 336378
You could remove everything except a set of legal characters:
train$var1 <- gsub("[^\\w\\s]", " ", train$var1, perl = TRUE)
would change every character that's not an alphanumeric or a whitespace character into a space, for example.
Upvotes: 5