Reputation: 185
I am having trouble with special characters and charset = iso-8859-1
.
The same code that I use here works fine with UTF-8, so I do not understand what I am doing wrong.
Here is the code:
File input = new File("/users/marcioapf/example.html");
Document doc = Jsoup.parse(input, "iso-8859-1", "");
Elements elements = doc.select("span.DEPUTADO") ;
System.out.println(elements.toString());
Here is the output:
<span style="margin-left: 8px; width: auto !important;" class="DEPUTADO">Joãozinho Pereira</span>
<span style="margin-left: 8px; width: auto !important;" class="DEPUTADO">Isnaldo Bulhões</span>
<span style="margin-left: 8px; width: auto !important;" class="DEPUTADO">Antonio Albuquerque</span>
<span style="margin-left: 8px; width: auto !important;" class="DEPUTADO">Jeferson Morais</span>
<span style="margin-left: 8px; width: auto !important;" class="DEPUTADO">Inácio Loiola</span>
Here is how it should be:
<span style="margin-left: 8px; width: auto !important;" class="DEPUTADO">Joãozinho Pereira</span>
<span style="margin-left: 8px; width: auto !important;" class="DEPUTADO">Isnaldo Bulhões</span>
<span style="margin-left: 8px; width: auto !important;" class="DEPUTADO">Antonio Albuquerque</span>
<span style="margin-left: 8px; width: auto !important;" class="DEPUTADO">Jeferson Morais</span>
<span style="margin-left: 8px; width: auto !important;" class="DEPUTADO">Inácio Loiola</span>
How I can I fix it?
Upvotes: 2
Views: 1630
Reputation: 10069
Using EscapeMode.xhtml
will give you output without entities.
Try this code
File input = new File("/users/marcioapf/example.html");
Document doc = Jsoup.parse(input, "iso-8859-1", "");
doc.outputSettings().escapeMode(EscapeMode.xhtml);
Elements elements = doc.select("span.DEPUTADO") ;
System.out.println(elements.toString());
Upvotes: 1