Neoasimov
Neoasimov

Reputation: 1111

Convert a UTF-32 encoded string (C style) in a UTF-16 (JSON style) encoded one in Java/Clojure

I am receiving a string from a service that apparently encode its unicode characters using UTF-32 encoding like: \U0001B000 (C style unicode encoding). However, for serializing this information in JSON, I do have to encode it in UTF-16 like: \uD82C\uDC00.

However, I have no idea how I can read such an encoded string in Java/Clojure, and how to produce an output with that other encoded format.

Upvotes: 3

Views: 845

Answers (2)

Symfrog
Symfrog

Reputation: 3418

You can read the received bytes from the service using:

(slurp received-bytes :encoding "UTF-32")

and write a string using:

(spit destination string-to-encode :encoding "UTF-16")

If you mean that you have a string that represents the binary of the encoded character, then you can convert it using:

(defn utf32->str [utf32-str]
  (let [buf (java.nio.ByteBuffer/allocate 4)]
    (.putInt buf (Integer/parseInt (subs  utf32-str 2) 16))
    (String. (.array buf) "UTF-32")))

(utf32->str "\\U0001B000" )

and then convert it to UTF-16 using:

(defn str->utf16 [s]
  (let [byte->str #(format "%02x" %)]
    (apply str
           (drop 1 (map #(str "\\U" (byte->str (first %) ) (byte->str (second %) ))
                        (partition 2 (.getBytes s "UTF-16")))))))

Here is a sample run:

(str->utf16 (utf32->str "\\U0001B000"))
;=> "\\Ud82c\\Udc00"

Upvotes: 2

xsc
xsc

Reputation: 6073

Once you have the string you want to replace, the following function will do it:

(defn escape-utf16
  [[_ _ a b c d]]
  (format "\\u%02X%02X\\u%02X%02X" a b c d))

(defn replace-utf32
  [^String s]
  (let [n (Integer/parseInt (subs s 2) 16)]
    (-> (->> (map #(bit-shift-right n %) [24 16 8 0])
             (map #(bit-and % 0xFF))
             (byte-array))
        (String. "UTF-32")
        (.getBytes "UTF-16")
        (escape-utf16))))

(replace-utf32 "\\U0001B000")
;; => "\\uD82C\\uDC00"

And, for targeted replacement, use a regex:

(require '[clojure.string :as string])
(string/replace
   "this is a text \\U0001B000."
   #"\\U[0-9A-F]{8}"
   replace-utf32)
;; => "this is a text \\uD82C\\uDC00."

Disclaimer: I haven't given a single thought to edge- (or any other than the provided) cases. But I'm sure you can use this as a base for further exploration.

Upvotes: 1

Related Questions