iconv() returns NA when given a string with a specific special character

Question

I am trying to convert some strings of an input file from UTF8 to ASCII. For most of the strings I give it, the conversion works perfectly fine with iconv(). However on some of them, it returns NA. While manually fixing the issue in the file seems like the simplest option, it is unfortunately not an option that I have available at the moment at all.

I have made a reproducible example of my problem but we assume to assume that I have to figure a way for iconv() to somehow convert the string in s1 and not get NA.

Here is the reproducible example:

s1 <- "Besançon" #as read from an input file I cannot modify
s2 <- "Paris"
s3 <- "Linköping"
s4 <- "Besançon" #Manual input for testing

s1 <- iconv(s1, to='ASCII//TRANSLIT')
s2 <- iconv(s2, to='ASCII//TRANSLIT')
s3 <- iconv(s3, to='ASCII//TRANSLIT')
s4 <- iconv(s4, to='ASCII//TRANSLIT')

I get the following output:

> s1
[1] NA
> s2
[1] "Paris"
> s3
[1] "Link\"oping"
> s4
[1] "Besancon"

After playing around with the code, I figured that something was wrong in the entry "Besançon" that is now copied exactly from the input file. When I input it manually myself, the problem is solved. Since I can't modify the input file at all, what do you think is the exact issue and would you have any idea on how to solve it?

Thanks in advance,

Edit:

After closer inspection, there is something odd in the characters of the first line. It seems to be taken away by SO's formatting. But to reproduce it, the best I could give is these two images describing it. First image places my cursor just before the # Second image is after pressing delete, which should delete the white space... turns out it deletes the ". So there is definitely something weird there.

LBes · Accepted Answer

It turns out that using sub='' actually solved the issue although I am quite unsure why.

iconv(s1, to='ASCII//TRANSLIT', sub='')

From the documentation sub

character string. If not NA it is used to replace any non-convertible bytes in the input. (This would normally be a single character, but can be more.) If "byte", the indication is "" with the hex code of the byte. If "Unicode" and converting from UTF-8, the Unicode point in the form "".

So I eventually figured out that there was a character I couldn't convert (nor see) in the string and using sub was a way to eliminate it. I am still not sure what this character is though. But the problem is solved.

iconv() returns NA when given a string with a specific special character

Answers (2)

Related Questions