Android/ Jsoup: how to fix encoding issues

Question

I'm developing an app to get legislation online and automatically parse and format it to fit the app. The test site i'm using is

http://www.planalto.gov.br/ccivil_03/constituicao/constituicao.htm

I want to grab all the contents of that URL, parse (maybe clean) them and put them in a file. I'm using Jsoup, this is the Runnable I use to connect and print the content to file:

class FetchHtmlRunnable implements Runnable {
        String url;

        FetchHtmlRunnable(String url) {
            this.url = url;
        }

        @Override
        public void run() {
            try {
                Document doc = Jsoup.parse(new URL(url), 10000);
                doc.charset(Charset.forName("windows-1252"));
                Charset charset = doc.charset();

                String htmlString = Jsoup.clean(doc.toString(), new Whitelist());

                Log.d(TAG, "run: HTMLSTRING: " + htmlString);

                String root = context.getFilesDir().toString();
                file = new File(root + File.separator + "law.txt");

                OutputStreamWriter out = new OutputStreamWriter(new FileOutputStream(file, false), charset);
                out.write(htmlString);
            } catch (IOException ex) {
                ex.printStackTrace();
            }
        }
    }

However, even though Chrome tells me the site's encoding is windows-1252, both the log entry and the file is not only filled with replacement characters (it loses all character with diacritics, such as í and ã), it also loses all new lines:

Constitui��o Presid�ncia da Rep�blica Casa Civil Subchefia para Assuntos Jur�dicos CONSTITUI��O DA REP�BLICA FEDERATIVA DO BRASIL DE 1988 Vide Emenda Constitucional n� 91, de 2016 Vide Emenda Constitucional n� 106, de 2020 Vide Emenda Constitucional n� 107, de 2020 Emendas Constitucionais Emendas Constitucionais de Revis�o Ato das Disposi��es Constitucionais Transit�rias Atos decorrentes do disposto no � 3� do art. 5� �NDICE TEM�TICO Texto compilado PRE�MBULO N�s, representantes do povo brasileiro, reunidos em Assembl�ia Nacional Constituinte para instituir um Estado Democr�tico, destinado a assegurar o exerc�cio dos direitos sociais e individuais, a liberdade, a seguran�a, o bem-estar, o desenvolvimento, a igualdade e a justi�a como valores supremos de uma sociedade fraterna, pluralista e sem preconceitos, fundada na harmonia social e comprometida

Maybe someone better at web dev can tell me if that's a problem with the webpage itslef and how I can work around that... And how I can keep the newline characters.

Y2020-09 · Accepted Answer

I will write the remainder of this answer about Character Sets in Portuguese, Spanish (And Chinese) in just a second... First, though, let me say that the page you are trying to read - actually loads the contents of the page using "AJAX / JS". I can download AJAX using my own library available on the Internet, but other tools like Selenium, Puppeteer, or Splash would be necessary. Without mentioning character sets, how are you downloading the contents of your "Brazilian Constitution" to HTML in the first place? When I try a straight HTML downloader (no script execution), I get a pile of Java-Script without any Portuguese at all - and it looks nothing like the HTML posted in your question. :)

If you are already downloading the HTML, and only have a problem with the character set, read the answer below. If you have been unable to download anything but the AJAX / JavaScript calls - I can post another answer that explains executing JS / AJAX in one or two lines in a different answer. (Essentially, what you posted isn't the same output that I'm getting).

In 99.9999% of the cases, if it is not straight up "ASCII" (because it has foreign language characters), then it is (almost) guaranteed to be readable using a version of "UTF-8" Character-Set. I translate Spanish News Articles and also Chinese News Articles - and UTF-8 always works for me. I had one Spanish Site that expected an encoding called "iso8859-1", but other than the "Don Quijote de La Mancha" site where I found it - UTF8 works.

To tell you the truth, it is never an issue at all because when reading a web-page (as opposed to writing one), Java has automatically parsed the text as if it were UTF-8 without any configurations whatsoever. Here is the "Open Connection" Method Body from a library I have written:

HttpURLConnection con =                     (HttpURLConnection) url.openConnection();
con.setRequestMethod                        ("GET");
if (USE_USER_AGENT) con.setRequestProperty  ("User-Agent", USER_AGENT);
return new BufferedReader                   (new InputStreamReader(con.getInputStream()));

Here is the method body of a "Scrape Contents" method from my library:

URL url = new URL("http://www.planalto.gov.br/ccivil_03/constituicao/constituicao.htm");
StringBuilder sb = new StringBuilder();
String s;
BufferedReader br = Scrape.openConn(url);
while ((s = br.readLine()) != null) sb.append(s + "
");
FileRW.writeFile(sb.toString(), "page.html");

I don't know the first thing about Microsoft Character Sets, to be fully honest with you. I have coded in UNIX, and I have never worried about any character sets - other than to make sure that when writing HTML (as opposed to Reading HTML), that the an HTML element is inserted into my pages.

Android/ Jsoup: how to fix encoding issues

Answers (1)

Related Questions