Reputation: 44648
This is the error that I receive when I try to run tolower()
on a character vector from a file that cannot be changed (at least, not manually - too large).
Error in tolower(m) : invalid multibyte string X
It seems to be French company names that are the problem with the É
character. Although I have not investigated all of them (also not possible to do so manually).
It's strange, because my thought was that encoding issues would have been identified during read.csv()
, rather than during operations after the fact.
Is there a quick way to remove these multibyte strings? Or, perhaps a way to identify and convert? Or even just ignore them entirely?
Upvotes: 19
Views: 37063
Reputation: 2183
library(tidyverse)
data_clean = data %>%
mutate(new_lowercase_col = tolower(enc2utf8(as.character(my_old_column))))
Where new_lowercase_col
is the name of the new column I'm making out of the old uppercase one, which was called my_old_column
.
Upvotes: 5
Reputation: 105
I was getting the same error. However, in my case it wasn't when I was reading the file, but a bit later when processing it. I realised that I was getting the error, because the file wasn't read with the correct encoding in the first place.
I found a much simpler solution (at least for my case) and wanted to share. I simply added encoding as below and it worked.
read.csv(<path>, encoding = "UTF-8")
Upvotes: 8
Reputation: 408
My solution to this issue
library(dplyr) # pipes
library(stringi) # for stri_enc_isutf8
#Read in csv data
old_data<- read.csv("non_utf_data.csv", encoding = "UTF-8")
#despite specifying utf -8, the below columns are not utf8:
all(stri_enc_isutf8(old_data$problem_column))
#The below code uses regular expressions to cleanse. May need to tinker with the last
#portion that selects the grammar to retain
utf_eight_data<- old_data %>%
mutate(problem_column = gsub("[^[:alnum:][:blank:]?&/\\-]", "", old_data$problem_column)) %>%
rename(solved_problem = problem_column)
#this column is now utf 8.
all(stri_enc_isutf8(utf_eight_data$solved_problem))
Upvotes: 0
Reputation: 1301
# to avoid datatables warning: error in tolower(x) invalid multibyte string
# assuming all columns are char
new_data <- as.data.frame(
lapply(old_data, enc2utf8),
stringsAsFactors = FALSE
)
Upvotes: 0
Reputation: 44648
Here's how I solved my problem:
First, I opened the raw data in a texteditor (Geany, in this case), clicked properties and identified the Encoding type.
After which I used the iconv()
function.
x <- iconv(x,"WINDOWS-1252","UTF-8")
To be more specific, I did this for every column of the data.frame
from the imported CSV. Important to note that I set stringsAsFactors=FALSE
in my read.csv()
call.
dat[,sapply(dat,is.character)] <- sapply(
dat[,sapply(dat,is.character)],
iconv,"WINDOWS-1252","UTF-8")
Upvotes: 24
Reputation: 41
I know this has been answered already but thought I'd share my solution to this as I experienced the same thing.
In my case, I used the function str_trim()
from package stringr
to trim whitespace from start and end of string.
com$uppervar<-toupper(str_trim(com$var))
Upvotes: 4