Bad character code when parsing xml in exml

Question

I have simple, elixir application:

defmodule Interop do
  @moduledoc false
  def main(args) do
        Application.ensure_all_started :inets
        resp = :httpc.request(:get, {'http://www.nbp.pl/kursy/xml/LastA.xml', []}, [], [])
        handle_response(resp)
  end

  defp handle_response({:ok, resp}) do
        {{_, _, _}, _headers, body} = resp
        doc = Exml.parse ~s/#{body}/
  end

  defp handle_response({:error, resp}) do
        IO.puts resp
  end

end

When I run it i get

** (exit) {:bad_character_code, [60, 116, 97, 98, 101, 108, 97, 95, 107, 117, 114, 115, 111, 119, 32, 116, 121, 112, 61, 34, 65, 34, 32, 117, 105, 100, 61, 34, 49, 54, 97, 49, 56, 54, 34, 62, 13, 10, 32, 32, 32, 60, 110, 117, 109, 101, 114, 95, ...], :"iso-8859-2"}
    xmerl_ucs.erl:511: :xmerl_ucs.to_unicode/2
    xmerl_scan.erl:709: :xmerl_scan.scan_prolog/4
    xmerl_scan.erl:565: :xmerl_scan.scan_document/2
    xmerl_scan.erl:288: :xmerl_scan.string/2
    lib/exml.ex:11: Exml.parse/2
    (elixir) lib/kernel/cli.ex:76: anonymous fn/3 in Kernel.CLI.exec_fun/2

When I download file manually and try to parse it I have the same issue.

My question is, where in my code is mistake? I think, but it only feeling, problem is with doc = Exml.parse ~s/#{body}/ and encoding of document. Any suggestions?

Dogbert · Accepted Answer

The problem is that ~s/#{body}/ actually changes the encoding of body from ISO-8859 to UTF-8 as it treats a list of integers as a list of Unicode codepoints:

iex(1)> ~s/#{[241]}/ <> <<0>>
<<195, 177, 0>>

while the XML file explicitly says that it's encoded as ISO-8859-2:

$ curl -s http://www.nbp.pl/kursy/xml/LastA.xml | head -1

Your code works if you force the XML parser to use UTF-8 encoding:

iex(1)> {:ok, {{_, _, _}, _headers, body}} = :httpc.request(:get, {'http://www.nbp.pl/kursy/xml/LastA.xml', []}, [], [])
{:ok,
 {{'HTTP/1.1', 200, 'OK'},
  [...],
  [60, 63, 120, 109, 108, 32, 118, 101, 114, 115, 105, 111, 110, 61, 34, 49, 46,
   48, 34, 32, 101, 110, 99, 111, 100, 105, 110, 103, 61, 34, 73, 83, 79, 45,
   56, 56, 53, 57, 45, 50, 34, 63, 62, 13, 10, 60, 116, ...]}}
iex(2)> "#{body}" |> Exml.parse(encoding: :"utf-8")
{:xmlElement, :tabela_kursow, :tabela_kursow, [], {:xmlNamespace, [], []}, [],
 1,
 [{:xmlAttribute, :typ, [], [], [], [tabela_kursow: 1], 1, [], 'A', false},
  {:xmlAttribute, :uid, [], [], [], [tabela_kursow: 1], 2, [], '16a186',
   false}],
 ...}

Bad character code when parsing xml in exml

Answers (2)

Related Questions