Koziołek
Koziołek

Reputation: 2874

Bad character code when parsing xml in exml

I have simple, elixir application:

defmodule Interop do
  @moduledoc false
  def main(args) do
        Application.ensure_all_started :inets
        resp = :httpc.request(:get, {'http://www.nbp.pl/kursy/xml/LastA.xml', []}, [], [])
        handle_response(resp)
  end

  defp handle_response({:ok, resp}) do
        {{_, _, _}, _headers, body} = resp
        doc = Exml.parse ~s/#{body}/
  end

  defp handle_response({:error, resp}) do
        IO.puts resp
  end

end

When I run it i get

** (exit) {:bad_character_code, [60, 116, 97, 98, 101, 108, 97, 95, 107, 117, 114, 115, 111, 119, 32, 116, 121, 112, 61, 34, 65, 34, 32, 117, 105, 100, 61, 34, 49, 54, 97, 49, 56, 54, 34, 62, 13, 10, 32, 32, 32, 60, 110, 117, 109, 101, 114, 95, ...], :"iso-8859-2"}
    xmerl_ucs.erl:511: :xmerl_ucs.to_unicode/2
    xmerl_scan.erl:709: :xmerl_scan.scan_prolog/4
    xmerl_scan.erl:565: :xmerl_scan.scan_document/2
    xmerl_scan.erl:288: :xmerl_scan.string/2
    lib/exml.ex:11: Exml.parse/2
    (elixir) lib/kernel/cli.ex:76: anonymous fn/3 in Kernel.CLI.exec_fun/2

When I download file manually and try to parse it I have the same issue.

My question is, where in my code is mistake? I think, but it only feeling, problem is with doc = Exml.parse ~s/#{body}/ and encoding of document. Any suggestions?

Upvotes: 1

Views: 441

Answers (2)

Aleksei Matiushkin
Aleksei Matiushkin

Reputation: 121000

If you’ll open the downloaded file in the text editor of your chioce, you’ll see it’s in "ISO-8859-2" encoding, then it gets converted to utf-8 by ~s sigil. The latter conversion assumes that the input is Latin1 aka ISO-8859-1, which is not the case.

Exml.parse expects binary input, so there is no way to pass char list received directly to it.

The easiest way to fix the issue would be to use codepagex:

doc = body
      |> :erlang.list_to_binary
      |> Codepagex.to_string!(:iso_8859_2)
      |> Exml.parse(encoding: 'utf-8') # force utf-8

When you expect any encoding to be received, the body should be parsed for encoding="ISO-8859-2" and the value matched is to be used as a parameter in call to Codepagex.to_string!:

xml = body |> IO.chardata_to_string
[[encoding]] = Regex.scan(~r/(?<=encoding=").*?(?=")/, xml)
doc = xml
      |> Codepagex.from_string!(:iso_8859_1)
      |> Codepagex.to_string!(
           encoding
           |> String.downcase
           |> String.replace("-", "_")
           |> String.to_atom
         )
      |> Exml.parse(encoding: 'utf-8') # force utf-8

Upvotes: 0

Dogbert
Dogbert

Reputation: 222278

The problem is that ~s/#{body}/ actually changes the encoding of body from ISO-8859 to UTF-8 as it treats a list of integers as a list of Unicode codepoints:

iex(1)> ~s/#{[241]}/ <> <<0>>
<<195, 177, 0>>

while the XML file explicitly says that it's encoded as ISO-8859-2:

$ curl -s http://www.nbp.pl/kursy/xml/LastA.xml | head -1
<?xml version="1.0" encoding="ISO-8859-2"?>

Your code works if you force the XML parser to use UTF-8 encoding:

iex(1)> {:ok, {{_, _, _}, _headers, body}} = :httpc.request(:get, {'http://www.nbp.pl/kursy/xml/LastA.xml', []}, [], [])
{:ok,
 {{'HTTP/1.1', 200, 'OK'},
  [...],
  [60, 63, 120, 109, 108, 32, 118, 101, 114, 115, 105, 111, 110, 61, 34, 49, 46,
   48, 34, 32, 101, 110, 99, 111, 100, 105, 110, 103, 61, 34, 73, 83, 79, 45,
   56, 56, 53, 57, 45, 50, 34, 63, 62, 13, 10, 60, 116, ...]}}
iex(2)> "#{body}" |> Exml.parse(encoding: :"utf-8")
{:xmlElement, :tabela_kursow, :tabela_kursow, [], {:xmlNamespace, [], []}, [],
 1,
 [{:xmlAttribute, :typ, [], [], [], [tabela_kursow: 1], 1, [], 'A', false},
  {:xmlAttribute, :uid, [], [], [], [tabela_kursow: 1], 2, [], '16a186',
   false}],
 ...}

Upvotes: 3

Related Questions