user797257
user797257

Reputation:

Translate Unicode (UTF-8) codepoint to bytes

I went as far as searching C sources, but I can't find this function, and I really don't want to write one myself because it absolutely must be there.

To elaborate: Unicode points are represented as U+######## - this is easy to get, what I need, is the format the character is written to a file (for example). A Unicode codepoint translates to bytes such that 7 bits of the rightmost byte are written to the first byte, then 6 bits of the next bits are written into the next byte and so on. Emacs certainly knows how to do it, but there's no way I can find to get the byte sequence of UTF-8 encoded string from it as a sequence of bytes (each containing 8 bits).

Functions such as get-byte or multybite-char-to-unibyte work only with characters that can be represented using no more then 8 bits. I need the same thing what get-byte does, but for multibyte characters, so that instead of an integer 0..256 I'd receive either a vector of integers 0..256 or a single long integer 0..2^32.

EDIT

Just in case anyone will need this later:

(defun haxe-string-to-x-string (s)
  (with-output-to-string
    (let (current parts)
      (dotimes (i (length s))
        (if (> 0 (multibyte-char-to-unibyte (aref s i)))
            (progn
              (setq current (encode-coding-string
                             (char-to-string (aref s i)) 'utf-8))
              (dotimes (j (length current))
                (princ (format "\\x%02x" (aref current j)))))
          (princ (format "\\x%02x" (aref s i))))))))

Upvotes: 6

Views: 1179

Answers (1)

legoscia
legoscia

Reputation: 41648

encode-coding-string might be what you're looking for:

*** Welcome to IELM ***  Type (describe-mode) for help.
ELISP> (encode-coding-string "eĥoŝanĝo ĉiuĵaŭde" 'utf-8)
"e\304\245o\305\235an\304\235o \304\211iu\304\265a\305\255de"

It returns a string, but you can access the individual bytes with aref:

ELISP> (aref (encode-coding-string "eĥoŝanĝo ĉiuĵaŭde" 'utf-8) 1)
196
ELISP> (format "%o" 196)
"304"

or if you don't mind using cl functions, concatenate is your friend:

ELISP> (concatenate 'list (encode-coding-string "eĥoŝanĝo ĉiuĵaŭde" 'utf-8))
(101 196 165 111 197 157 97 110 196 157 111 32 196 137 105 117 196 181 97 197 173 100 101)

Upvotes: 5

Related Questions