SyntaxT3rr0r
SyntaxT3rr0r

Reputation: 28293

Emacs hexl-mode UTF8 BOM issue

I ran into something a bit weird with the hexl-mode under Emacs (GNU Emacs 22.2.1 / Debian GNU Linux).

I had an UTF8 text file to which I wanted to append a BOM (Byte Order Mask: even though it is not recommended to append a pointless BOM to an UTF8 file, the spec clearly specify that a BOM in an UTF8 file is legal).

Here's how the file is seen by the file command:

...$  file  /tmp/test.txt
/tmp/test.txt: UTF-8 Unicode English text

The following works:

open the UTF8 file (without BOM) in text mode
add three ASCII characters at the beginning of the file
close the file   (<-- see, very important, I need to close the file)
M-x hexl-mode
M-x hexl-find-file  (re-opening the file but this time in hexl-mode)
M-x hexl-insert-hex-string
EFBBBF
C-x C-s (saving the file)
M-x hexl-mode-exit

I then get an UTF-8 file with a BOM, as shown here by the file command:

...$  file  /tmp/test.txt
/tmp/test.txt: UTF-8 Unicode (with BOM) English text

(note that the file command detects this heuristically as an UTF-8 with BOM "English text" but the file does contain a lot of Euro symbol: my point is that, before adding the BOM, it is NOT an ASCII file but already an UTF-8 file, as shown above)

However I simply cannot open the file under Emacs first then call hexl-mode then try to replace the first three characters by 0xEB 0xFF 0xBF (the BOM) and then save.

Apparently there are crazy conversion issues taking place when switching from (Text) to (Hexl) mode.

Am I missing something obvious or is converting to/from Text / Hexl a bit broken and I'm better to switch to hexl-mode first, do my hex editing then save & close the file and re-open in text mode?

Upvotes: 3

Views: 1988

Answers (2)

js2010
js2010

Reputation: 27428

Note that an xml file with this tag will be silently converted to utf-16 big endian on saving.

<?xml version="1.0" encoding="UTF-16"?>

This would automatically make the file utf8 with bom after changing and saving it:

<?xml version="1.0" encoding="UTF-8"?>

Upvotes: 0

Oleg Pavliv
Oleg Pavliv

Reputation: 21162

If you take a look on hexl-find-file code you will see that it calls find-file-literally and then switch to the hexl-mode.

From the documentation of find-file-literally

Visit file FILENAME with no conversion of any kind. Format conversion and character code conversion are both disabled,and multibyte characters are disabled in the resulting buffer.

So you may open your file with find-file-literally add 3 characters and then switch to the hexl-mode.

Upvotes: 3

Related Questions