Reputation: 2874
I have simple, elixir application:
defmodule Interop do
@moduledoc false
def main(args) do
Application.ensure_all_started :inets
resp = :httpc.request(:get, {'http://www.nbp.pl/kursy/xml/LastA.xml', []}, [], [])
handle_response(resp)
end
defp handle_response({:ok, resp}) do
{{_, _, _}, _headers, body} = resp
doc = Exml.parse ~s/#{body}/
end
defp handle_response({:error, resp}) do
IO.puts resp
end
end
When I run it i get
** (exit) {:bad_character_code, [60, 116, 97, 98, 101, 108, 97, 95, 107, 117, 114, 115, 111, 119, 32, 116, 121, 112, 61, 34, 65, 34, 32, 117, 105, 100, 61, 34, 49, 54, 97, 49, 56, 54, 34, 62, 13, 10, 32, 32, 32, 60, 110, 117, 109, 101, 114, 95, ...], :"iso-8859-2"}
xmerl_ucs.erl:511: :xmerl_ucs.to_unicode/2
xmerl_scan.erl:709: :xmerl_scan.scan_prolog/4
xmerl_scan.erl:565: :xmerl_scan.scan_document/2
xmerl_scan.erl:288: :xmerl_scan.string/2
lib/exml.ex:11: Exml.parse/2
(elixir) lib/kernel/cli.ex:76: anonymous fn/3 in Kernel.CLI.exec_fun/2
When I download file manually and try to parse it I have the same issue.
My question is, where in my code is mistake? I think, but it only feeling, problem is with doc = Exml.parse ~s/#{body}/
and encoding of document. Any suggestions?
Upvotes: 1
Views: 441
Reputation: 121000
If you’ll open the downloaded file in the text editor of your chioce, you’ll see it’s in "ISO-8859-2"
encoding, then it gets converted to utf-8
by ~s
sigil. The latter conversion assumes that the input is Latin1
aka ISO-8859-1
, which is not the case.
Exml.parse
expects binary input, so there is no way to pass char list received directly to it.
The easiest way to fix the issue would be to use codepagex
:
doc = body
|> :erlang.list_to_binary
|> Codepagex.to_string!(:iso_8859_2)
|> Exml.parse(encoding: 'utf-8') # force utf-8
When you expect any encoding to be received, the body
should be parsed for encoding="ISO-8859-2"
and the value matched is to be used as a parameter in call to Codepagex.to_string!
:
xml = body |> IO.chardata_to_string
[[encoding]] = Regex.scan(~r/(?<=encoding=").*?(?=")/, xml)
doc = xml
|> Codepagex.from_string!(:iso_8859_1)
|> Codepagex.to_string!(
encoding
|> String.downcase
|> String.replace("-", "_")
|> String.to_atom
)
|> Exml.parse(encoding: 'utf-8') # force utf-8
Upvotes: 0
Reputation: 222278
The problem is that ~s/#{body}/
actually changes the encoding of body
from ISO-8859 to UTF-8 as it treats a list of integers as a list of Unicode codepoints:
iex(1)> ~s/#{[241]}/ <> <<0>>
<<195, 177, 0>>
while the XML file explicitly says that it's encoded as ISO-8859-2:
$ curl -s http://www.nbp.pl/kursy/xml/LastA.xml | head -1
<?xml version="1.0" encoding="ISO-8859-2"?>
Your code works if you force the XML parser to use UTF-8 encoding:
iex(1)> {:ok, {{_, _, _}, _headers, body}} = :httpc.request(:get, {'http://www.nbp.pl/kursy/xml/LastA.xml', []}, [], [])
{:ok,
{{'HTTP/1.1', 200, 'OK'},
[...],
[60, 63, 120, 109, 108, 32, 118, 101, 114, 115, 105, 111, 110, 61, 34, 49, 46,
48, 34, 32, 101, 110, 99, 111, 100, 105, 110, 103, 61, 34, 73, 83, 79, 45,
56, 56, 53, 57, 45, 50, 34, 63, 62, 13, 10, 60, 116, ...]}}
iex(2)> "#{body}" |> Exml.parse(encoding: :"utf-8")
{:xmlElement, :tabela_kursow, :tabela_kursow, [], {:xmlNamespace, [], []}, [],
1,
[{:xmlAttribute, :typ, [], [], [], [tabela_kursow: 1], 1, [], 'A', false},
{:xmlAttribute, :uid, [], [], [], [tabela_kursow: 1], 2, [], '16a186',
false}],
...}
Upvotes: 3