Martin
Martin

Reputation: 3

UTF-8 Encoding in java, retrieving data from website

I'm trying to get data from website which is encoded in UTF-8 and insert them into the database (MYSQL). Database is also encoded in UTF-8.

This is the method I use to download data from specific site.

public String download(String url) throws java.io.IOException {
        java.io.InputStream s = null;
        java.io.InputStreamReader r = null;
        StringBuilder content = new StringBuilder();
        try {
            s = (java.io.InputStream)new URL(url).getContent();

            r = new java.io.InputStreamReader(s, "UTF-8");

            char[] buffer = new char[4*1024];
            int n = 0;
            while (n >= 0) {
                n = r.read(buffer, 0, buffer.length);
                if (n > 0) {
                    content.append(buffer, 0, n);
                }
            }
        }
        finally {
            if (r != null) r.close();
            if (s != null) s.close(); 
        }
        return content.toString();
    }

If encoding is set to 'UTF-8' (r = new java.io.InputStreamReader(s, "UTF-8"); ) data inserted into database seems to look OK, but when I try to display it, I am getting something like this: C�te d'Ivoire, instead of Côte d'Ivoire.

All my websites are encoded in UTF-8.

Please help.

If encoding is set to 'windows-1252' (r = new java.io.InputStreamReader(s, "windows-1252"); ) everything works fine and I am getting Côte d'Ivoire on my website (), but in java this title looks like 'C?´te d'Ivoire' what breaks other things, such as for example links. What does it mean ?

Upvotes: 0

Views: 19915

Answers (4)

BalusC
BalusC

Reputation: 1108537

If encoding is set to 'UTF-8' (r = new java.io.InputStreamReader(s, "UTF-8"); ) data inserted into database seems to look OK, but when I try to display it, I am getting something like this: C�te d'Ivoire, instead of Côte d'Ivoire.

Thus, the encoding during the display is wrong. How are you displaying it? As per the comments, it's a PHP page? If so, then you need to take two things into account:

  1. Write them to HTTP response output using the same encoding, thus UTF-8.
  2. Set content type to UTF-8 so that the webbrowser knows which encoding to use to display text.

As per the comments, you have apparently already done 2. Left behind 1, in PHP you need to install mb_string and set mbstring.http_output to UTF-8 as well. I have found this cheatsheet very useful.

Upvotes: 1

glmxndr
glmxndr

Reputation: 46556

Java

The problem seems to lie in the HttpServletResponse , if you have a servlet or jsp page. Make sure to set your HttpServletResponse encoding to UTF-8.

In a jsp page or in the doGet or doPost of a servlet, before any content is sent to the response, just do :

response.setCharacterEncoding("UTF-8");

PHP

In PHP, try to use the utf8-encode function after retrieving from the database.

Upvotes: 2

Tomas
Tomas

Reputation: 1725

I would consider using commons-io, they have a function doing what you want to do:link

That is replace your code with this:

public String download(String url) throws java.io.IOException {
    java.io.InputStream s = null;
    String content = null;
    try {
        s = (java.io.InputStream)new URL(url).getContent();
        content = IOUtils.toString(s, "UTF-8")

    }
    finally {
        if (s != null) s.close(); 
    }
    return content.toString();
}

if that nots doing start looking into if you can store it to file correctly to eliminate the possibility that your db isn't set up correctly.

Upvotes: 7

Confusion
Confusion

Reputation: 16841

Is your database encoding set to UTF-8 for both server, client, connection and have the tables been created with that encoding? Check 'show variables' and 'show create table <one-of-the-tables>'

Upvotes: 1

Related Questions