Reputation: 121137
I have some data that contains non-ASCII characters, that I want to include as an rda
file in an R package. When I run an R CMD check
on the package, I get a warning:
Warning: found non-ASCII strings
which is blocking it being allowed on CRAN.
There's a similar question about removing non-ASCII characters from data files, but I want to keep the non-ASCII characters.
You can grab the CSV data here. I'm reading it into R and resaving as rda
with this code:
english_monarchs <- read.csv(
wherever_you_downloaded_the_file_to,
fileEncoding = "utf8",
na.strings = ""
)
save(english_monarchs, "english_monarchs.csv")
It's the name
column of the dataset that contains non-ascii values.
head(levels(english_monarchs$name))
## [1] "Adda" "Æðelbehrt"
## [3] "Æðelberht I" "Æðelberht II and Eardwulf"
## [5] "Æðelberht II, Ælfric and Eadberht I" "Æðelberht III"
Based upon the (not very clear) guidance in the Encoding Issues section of Writing R Extensions I think I ought to be encoding the factor levels as UTF-8, but the obvious method doesn't work:
Encoding(levels(english_monarchs$name)) <- "utf8" #each encoding still "unknown"
How can I make the data portable enough to be accepted on CRAN?
Upvotes: 29
Views: 3148
Reputation: 121137
The thing that worked for me was to declare the encoding as "latin1"
, and then use iconv
to convert to UTF-8.
Encoding(levels(english_monarchs$name)) <- "latin1"
levels(english_monarchs$name) <- iconv(
levels(english_monarchs$name),
"latin1",
"UTF-8"
)
Upvotes: 16