Reputation: 13735

How to convert txt to pdf with utf-8?

I use the following command to convert txt to ps. Then convert ps to pdf.

enscript --header='Page $% of $=' --word-wrap -o output.ps 2>/dev/null < input.txt

But it does not work for utf-8 input.

enscript --header='Page $% of $=' --word-wrap -o output.ps 2>/dev/null <<< ℃

The above command results in â\204\203 in the output file.

I see discussions saying that enscript does not support utf-8. There seems to be several alternatives that convert txt to pdf. But it is not clear which one is the most robust and convenient to use. Does anybody know a best solution to this problem?

Upvotes: 2

Answers (3)

K J

Reputation: 11827

For Windows users the simplest TXT2PDF is use native command line printing.

Let us take the u2ps sample alph.txt

Latin alphabet
Кириллица
Ελληνικό αλφάβητο
ქართული დამწერლობა
อักษรไทย
カタカナ 漢字

If that is saved as RTF (Rich Text Format) then it can be converted directly by print to searchable PDF using the command line shell.

write /pt alph.rtf "Microsoft Print to PDF" "Microsoft Print to PDF" alph.pdf

There is the limitation the txt must be saved as Rich not plain text, otherwise it will not be printed as UTF-8.

Write is "WordPad.exe" and the hardcoded font selections cannot be easily changed other than save as file.RTF.

Upvotes: 0

Maxim

Reputation: 231

I am really fond of Alex Suykov's u2ps program for converting Unicode text to PostScript and PDF.

It was initially written in Perl; now there is also a C version.

Usage:

   $ u2ps a.txt

   $ ps2pdf  a.ps a.pdf

Home page:

https://github.com/arsv/u2ps

Compiling the newer C version:

$ git clone https://github.com/arsv/u2ps
$./configure --prefix=/opt/u2ps-Alex-Suykov
$ make
$ sudo make install

this creates two executives:

/opt/u2ps-Alex-Suykov/bin/psfrem
/opt/u2ps-Alex-Suykov/bin/u2ps

where the first one is a font utility.

One needs both of them; hence

  $export PATH=$PATH:/opt/u2ps-Alex-Suykov/bin/

Warning. There is a different program by the same name written by by Yukihiro Nakai at the http://dev.man-online.org/man1/u2ps/

Upvotes: 0

KenS

Reputation: 31199

(Tackling this as a programming question, and not a request for software recommendation, which would be off-topic).

You can't use UTF-8, or at least not simply. PostScript does not support UTF-8 directly at all. However....

Since PostScript is a programming language, you could write a program which examines the first byte of the UTF-8 sequence to see whether it's a character code, or a code indicating further bytes. Essentially undoing the encoding to produce a Unicode code point.

From there, with a list of glyph names and Unicode code points, you could create a font with a custom Encoding, and instead of writing UTF-8 into the PostScript program, write the single byte which maps the character code through the Encoding to the relevant glyph name.

Or you could define a CIDFont, and then create a CMap which maps the variable length byte sequences of UTF-8 into CIDs to reference the correct glyph from the font. IIRC there are already UTF-16 CMaps around, in fact Adobe makes a number of them available here which also includes UTF-16 and UTF-32 versions for various CJKV languages.

Be aware that while these approaches will produce PostScript which renders correctly, and then can be used to create a PDF file which displays correctly, it will not be possible to copy/search the resulting PDF file.

In order to search a PDF file the font must have an associated ToUnicode CMap, this is a PDF-only construct, it does not exist in PostScript and there is no PostScript equivalent. So there's no way to embed that information in the PostScript program, which means it can't be embedded in the PDF file.

Upvotes: 1

How to convert txt to pdf with utf-8?

Answers (3)

Related Questions