antonio
antonio

Reputation: 11120

knitr: generating UTF-8 output from chunks

I have a doc.Rnw supposed to produce some Russian UTF-8 strings:

\documentclass{article}
\usepackage{inputenc}
\inputencoding{utf8}
\usepackage[main=english,russian]{babel}
\begin{document}
\selectlanguage {russian} 
<<test, results='asis', echo=FALSE>>=
print(readLines('string.rus', encoding="UTF-8"))

print("Здравствуйте")
@

Здравствуйте
\selectlanguage {english}
\end{document}

string.rus has a UTF-8 string which corrrctly shows in R console:

print(readLines('string.rus', encoding="UTF-8"))    
# [1] "Здравствуйте"

doc.Rnw coorectly shows in Windows notepad, while both:

file.show("doc.Rnw")
file.show("doc.Rnw", encoding="UTF-8")

fail to show properly the UTF-8 strings.

Using:

knit("doc.Rnw")

The document part of the output doc.tex shows:

\begin{document}
\selectlanguage {russian} 
[1] "<U+0417><U+0434><U+0440><U+0430><U+0432><U+0441><U+0442><U+0432><U+0443><U+0439><U+0442><U+0435>"
[1] " <U+0097>д <U+0080>авс <U+0082>в <U+0083>й <U+0082>е"


Здравствуйте
\selectlanguage {english}
\end{document}

which of course does not compile in PDFLaTeX. Using:

knit("doc.Rnw", encoding="UTF-8")

gives even worse results.

Commenting the chunks which should generate UTF-8 strings:

print(readLines('string.rus', encoding="UTF-8"))     
print("Здравствуйте")

gives a valid doc.tex which compiles in MikTeX and shows properly the remaining UTF-8 string.
Even if I comment the first print... and leave only the second one. I can't compile. This seems to prove that the original encoding of doc.Rnw is correct.

I tried to replace both print commands with:

a="Здравствуйте"
Encoding(a)="UTF-8"
print(a)

In this case I can compile, but the PDF output is (first string is cut out from margin):

[1] «U+0417><U+0434><U+0440><U+0430><U+0432><U+0441><U+0442><U+0432><U+0443>
Здравствуйте

So the chunk output is still wrong.

How to properly print UTF-8 strings from chunks?
R version is 3.3.3 (2017-03-06) for Windows and knitr is 1.15.1 (2016-11-22).

Upvotes: 1

Views: 610

Answers (1)

antonio
antonio

Reputation: 11120

An extended working example is below:

\documentclass{article}
\usepackage{inputenc}
\inputencoding{utf8}
\usepackage[main=english,russian]{babel}
\begin{document}
\selectlanguage {russian} 
<<test, results='asis', echo=FALSE>>=

s=readLines('string.rus', , encoding="UTF-8")
message("s ", Encoding(s), ": ", s)
Encoding(s)="latin1"
message("s latin1: ", s)
Encoding(s)="unkwnown"
message("s unkwnown: ", s)
Encoding(s)="utf8"
message("s utf8: ", a)


a="Здравствуйте"
message("a ", Encoding(a), ": ", a)
Encoding(a)="latin1"
message("a latin1: ", a)
Encoding(a)="utf8"
message("a utf8: ", a)
Encoding(a)="UTF-8"
message("a UTF-8: ", a)

u=("\U0417")
message("u ", Encoding(u), ": ", u)
Encoding(u)="latin1"
message("u latin1: ", u)
Encoding(u)="unkwnown"
message("u unkwnown: ", u)

@

Здравствуйте
\selectlanguage {english}
\end{document}

After knit("doc.Rnw", this is the output related to test chunk found in doc.tex (without knitr code decoration for readability):

s UTF-8: <U+0417><U+0434><U+0440><U+0430><U+0432><U+0441><U+0442><U+0432><U+0443><U+0439><U+0442><U+0435>

s latin1: Здравствуйте

s unkwnown: Здравствуйте

s utf8: <U+0417><U+0434><U+0440><U+0430><U+0432><U+0441><U+0442><U+0432><U+0443><U+0439><U+0442><U+0435>

a unknown: Здравствуйте

a latin1: Здравствуйте

a utf8: Здравствуйте

a UTF-8: <U+0417><U+0434><U+0440><U+0430><U+0432><U+0441><U+0442><U+0432><U+0443><U+0439><U+0442><U+0435>

u UTF-8: <U+0417>

u latin1: З

u unkwnown: З

Some comments follow.

First, only message() works, print() gives always errors.

In both the externally read string s and the locally set a, the behavior is weird.
in fact, keeping or explicitly setting the code to UTF-8 produces the wrong results (utf8 works for a).
One might think the UTF8 encoding of the documents (doc.Rnw and string.rus) is not properly set. This is why I added the line u=("\U0417"), which is UTF8 for sure. Again, only removing the UTF8 encoding gives a proper output.

In a simialr fashion, requesting explicitly an UTF8 output:

knit("doc.Rnw", encoding="UTF-8")

does not produce the UTF8 charaters, but their unicode values or weird ones.

In the end, I can produce the desired .tex file and compile the LaTeX it, but why there is the above counter-intuitive behavior is beyond me.
Hopefully someone will give a good explanation.

Upvotes: 1

Related Questions