Read a charlist in non utf-8 encoding

Question

Assume we get a charlist from the foreign source, and it basically represents a string in some legacy 1-byte encoding like ISO-8859-2. There is a CodepageX package, that simplifies the conversions between different encodings, but it’s to_string function expects a [binary] as an input.

All the standard library functions assume Latin1 aka ISO-8859-1 input encoding when transforming to utf8 (like to_string, IO.chardata_to_string, "#{}" etc.)

What I came up with is:

input
  |> to_string
  |> Codepagex.from_string!(:iso_8859_1)
  |> Codepagex.to_string!(:iso_8859_2) # target encoding

which is a bit ugly.

Is there any robust and handy built-in/idiomatic elixir way to get a string out of charlist in known encoding?

Dogbert · Accepted Answer

to_string on a List of integers in Elixir treats the integers as Unicode codepoints (to_string [960] #=> "π") while you want to treat each integer as a byte. In Erlang, this can be done using list_to_binary. I couldn't find any wrapper for this in Elixir's builtin modules but you can always call :erlang.list_to_binary:

iex(1)> [224] |> :erlang.list_to_binary
<<224>>
iex(2)> inspect ([224] |> to_string), binaries: :as_binaries
"<<195, 160>>"
iex(3)> [224] |> :erlang.list_to_binary |> Codepagex.to_string!(:iso_8859_1)
"à"
iex(4)> [224] |> :erlang.list_to_binary |> Codepagex.to_string!(:iso_8859_2)
"ŕ"

Read a charlist in non utf-8 encoding

Answers (1)

Related Questions