Reputation: 259
I have several thousand Khmer-language .docx
files and would like to convert them to .pdf
format using Pandoc.
I installed Pandoc using MacPorts. Pandoc requires LaTeX for PDF conversion, so I installed MacTeX. Installation appears to have gone properly, and I've been able to convert English-language .docx
files into .pdf
without difficulty.
When I try to convert a Khmer-language file (you can find an example at https://briancroxall.net/pandoc/transcription.docx) to PDF, I use the following command:
pandoc transcription.docx -s -o transcript.pdf
I receive the following error:
Error producing PDF.
! Package inputenc Error: Unicode character អ (U+17A2)
(inputenc) not set up for use with LaTeX.
See the inputenc package documentation for explanation.
Type H <return> for immediate help.
...
l.64 ...�នៅសម័យប៉ុល ពត។}
Try running pandoc with --pdf-engine=xelatex.
Following this suggestion, I use this command:
pandoc --pdf-engine=xelatex transcription.docx -s -o transcript.pdf
Pandoc then throws an error message for every Khmer character in the text:
[WARNING] Missing character: There is no អ in font [lmroman10-bold]:mapping=tex-text;!
[WARNING] Missing character: There is no ្ in font [lmroman10-bold]:mapping=tex-text;!
[WARNING] Missing character: There is no ន in font [lmroman10-bold]:mapping=tex-text;!
...
A PDF is produced by this process (see https://briancroxall.net/pandoc/transcript.pdf), but it is largely empty.
As best as I can tell, this suggests that Khmer characters are not being available in the LaTeX engine that I'm trying to use to do the conversion. Whether or not that is so, how can I manage this file conversion successfully?
Upvotes: 10
Views: 7980
Reputation: 259
mb21's comment helped me figure this out. Since my system has a couple of Khmer fonts installed, I had to set mainfont
to use one of them.
$ pandoc --pdf-engine=xelatex transcription.docx \
-V 'mainfont:Khmer MN' -s -o transcription.pdf
This produces a PDF with Khmer characters and no error messages.
The PDF does seem to have some issues in that some phrases in Khmer run off the margin of the page. I think this is due to segmentation issues that Word is equipped to deal with but that get messed up in conversion to PDF.
Upvotes: 11