Brandon Bertelsen
Brandon Bertelsen

Reputation: 44648

Error in tolower() invalid multibyte string

This is the error that I receive when I try to run tolower() on a character vector from a file that cannot be changed (at least, not manually - too large).

Error in tolower(m) : invalid multibyte string X

It seems to be French company names that are the problem with the É character. Although I have not investigated all of them (also not possible to do so manually).

It's strange, because my thought was that encoding issues would have been identified during read.csv(), rather than during operations after the fact.

Is there a quick way to remove these multibyte strings? Or, perhaps a way to identify and convert? Or even just ignore them entirely?

Upvotes: 19

Views: 37063

Answers (6)

dmt
dmt

Reputation: 2183

library(tidyverse)

data_clean = data %>%
    mutate(new_lowercase_col = tolower(enc2utf8(as.character(my_old_column))))

Where new_lowercase_col is the name of the new column I'm making out of the old uppercase one, which was called my_old_column.

Upvotes: 5

Onur Ece
Onur Ece

Reputation: 105

I was getting the same error. However, in my case it wasn't when I was reading the file, but a bit later when processing it. I realised that I was getting the error, because the file wasn't read with the correct encoding in the first place.

I found a much simpler solution (at least for my case) and wanted to share. I simply added encoding as below and it worked.

read.csv(<path>, encoding = "UTF-8")

Upvotes: 8

M_Merciless
M_Merciless

Reputation: 408

My solution to this issue

library(dplyr) # pipes
library(stringi) # for stri_enc_isutf8

#Read in csv data
old_data<- read.csv("non_utf_data.csv", encoding = "UTF-8")

#despite specifying utf -8, the below columns are not utf8:
all(stri_enc_isutf8(old_data$problem_column))

#The below code uses regular expressions to cleanse. May need to tinker with the last 
#portion that selects the grammar to retain

utf_eight_data<- old_data %>% 
  mutate(problem_column = gsub("[^[:alnum:][:blank:]?&/\\-]", "", old_data$problem_column)) %>%
rename(solved_problem = problem_column)

#this column is now utf 8.

all(stri_enc_isutf8(utf_eight_data$solved_problem))

Upvotes: 0

Edgar Manukyan
Edgar Manukyan

Reputation: 1301

# to avoid datatables warning: error in tolower(x) invalid multibyte string
# assuming all columns are char
new_data <- as.data.frame(
  lapply(old_data, enc2utf8),
  stringsAsFactors = FALSE
)

Upvotes: 0

Brandon Bertelsen
Brandon Bertelsen

Reputation: 44648

Here's how I solved my problem:

First, I opened the raw data in a texteditor (Geany, in this case), clicked properties and identified the Encoding type.

After which I used the iconv() function.

x <- iconv(x,"WINDOWS-1252","UTF-8")

To be more specific, I did this for every column of the data.frame from the imported CSV. Important to note that I set stringsAsFactors=FALSE in my read.csv() call.

dat[,sapply(dat,is.character)] <- sapply(
    dat[,sapply(dat,is.character)],
    iconv,"WINDOWS-1252","UTF-8")

Upvotes: 24

user3866101
user3866101

Reputation: 41

I know this has been answered already but thought I'd share my solution to this as I experienced the same thing.

In my case, I used the function str_trim() from package stringr to trim whitespace from start and end of string.

com$uppervar<-toupper(str_trim(com$var))

Upvotes: 4

Related Questions