Reputation: 947
I am trying to import a csv that is in Japanese. This code:
url <- 'http://www.mof.go.jp/international_policy/reference/itn_transactions_in_securities/week.csv'
x <- read.csv(url, header=FALSE, stringsAsFactors=FALSE)
returns the following error:
Error in type.convert(data[[i]], as.is = as.is[i], dec = dec, na.strings = character(0L)) :
invalid multibyte string at '<91>ΊO<8b>y<82>ёΓ<e0><8f>،<94><94><84><94><83><8c>_<96>̏@(<8f>T<8e><9f><81>E<8e>w<92><e8><95>@<8a>փx<81>[<83>X<81>j'
I tried changing the encoding (Encoding(url) <- 'UTF-8'
and also to latin1) and tried removing the read.csv parameters, but received the same "invalid multibyte string" message in each case. Is there a different encoding that should be used, or is there some other problem?
Upvotes: 75
Views: 179954
Reputation: 594
I have this problem with DBI connection reading a sql file with read_lines
; but seems the file has nothing to do with.
Refreshing my sql connection (re-connect) solves the issue.
I have not idea that strange behavior.
Sys.info()
sysname release version machine
"Windows" "10 x64" "build 19044" "x86-64"
Upvotes: 0
Reputation: 31
Did you use copy-paste to create CSV-file? I had the same error and successfully tried the most popular solution from this thread (fileEncoding="latin1"). After I re-saved the data frame into a CSV-file, I found that some cells had extra space after the cell value (encoded as A-tilde). I removed these spaces in the original file and was able to read it without fileEncoding="latin1" and without any error.
Upvotes: 0
Reputation: 153822
R's read.csv()
will puke on all multi-byte characters if it is expecting a number.
I'm using Version: R version 4.2.1 (2022-06-23)
Put this data in file named: /tmp/foo.csv
#year,someval
2022,0.1389
2021,0.0000°
2020,0.2857
If you look close you can see the 0.0000
value on line 2 has a 'degree' symbol on it.
Load it this way using read.csv:
> read.csv('/tmp/foo.csv')
Error in type.convert.default(data[[i]], as.is = as.is[i], dec = dec, :
invalid multibyte string at '<b0>0'
Calls: read.csv -> read.table -> type.convert -> type.convert.default
Execution halted
What does cat
have to say about that guff:
$ cat /tmp/foo.csv
#year,someval
2022,0.1389
2021,0.0000�
2020,0.2857
We do not tolerate that "Degrees" symbol. Changing the encoding does nothing to help. You could try telling read.csv to interpret everything as a string, but now you've got string to number conversion issues downstream.
What does read.csv2 have to say?:
> read.csv2('/tmp/foo.csv')
X.year.someval
1 2022,0.1389
2 2021,0.000\xb0
3 2020,0.2857
https://www.codetable.net/hex/b0
Upvotes: 0
Reputation: 511
I came across this error (invalid multibyte string 1
) recently, but my problem was a bit different:
We had forgotten to save a csv.gz file with an extension, and tried to use read_csv()
to read it. Adding the extension solved the problem.
Upvotes: 0
Reputation: 53
The simplest solution I found for this issue without losing any data/special character (for example when using fileEncoding="latin1"
characters like the Euro sign € will be lost) is to open the file first in a text editor like Sublime Text, and to "Save with encoding - UTF-8".
Then R can import the file with no issue and no character loss.
Upvotes: 4
Reputation: 52208
I had the same error and tried all the above to no avail. The issue vanished when I upgraded from R 3.4.0 to 3.4.3, so if your R version is not up to date, update it!
Upvotes: 0
Reputation: 1153
You may have encountered this issue because of the incompatibility of system locale
try setting the system locale with this code Sys.setlocale("LC_ALL", "C")
Upvotes: 18
Reputation: 176
The readr package from the tidyverse universe might help.
You can set the encoding via the local argument of the read_csv()
function by using the local()
function and its encoding argument:
read_csv(file = "http://www.mof.go.jp/international_policy/reference/itn_transactions_in_securities/week.csv",
skip = 14,
local = locale(encoding = "latin1"))
Upvotes: 15
Reputation: 98
If the file you are trying to import into R that was originally an Excel file. Make sure you open the original file and Save as a csv and that fixed this error for me when importing into R.
Upvotes: 0
Reputation: 49
I had a similar problem with scientific articles and found a good solution here: http://tm.r-forge.r-project.org/faq.html
By using the following line of code:
tm_map(yourCorpus, content_transformer(function(x) iconv(enc2utf8(x), sub = "byte")))
you convert the multibyte strings into hex code. I hope this helps.
Upvotes: 0
Reputation: 14438
For those using Rattle
with this issue Here is how I solved it:
> library (rattle)
(if not done so already)> crv$csv.encoding="latin1"
> rattle()
That worked for me, hopefully that helps a weary traveller
Upvotes: 0
Reputation: 176638
Encoding
sets the encoding of a character string. It doesn't set the encoding of the file represented by the character string, which is what you want.
This worked for me, after trying "UTF-8"
:
x <- read.csv(url, header=FALSE, stringsAsFactors=FALSE, fileEncoding="latin1")
And you may want to skip the first 16 lines, and read in the headers separately. Either way, there's still quite a bit of cleaning up to do.
x <- read.csv(url, header=FALSE, stringsAsFactors=FALSE,
fileEncoding="latin1", skip=16)
# get started with the clean-up
x[,1] <- gsub("\u0081|`", "", x[,1]) # get rid of odd characters
x[,-1] <- as.data.frame(lapply(x[,-1], # convert to numbers
function(d) type.convert(gsub(d, pattern=",", replace=""))))
Upvotes: 118