Reputation: 2459
I crawled list of movies and stored them in my database. Everything works fine for movies which contain only English characters but the problem is that some of movie names that contain non English characters cannot be displayed correctly. For example, the Italian movie "Il più crudele dei giorni" is stored as "Il pi& ugrave; crudele dei giorni".
Could someone kindly let me know if there is any solution? (I know that I can set the language for the crawler, I already crawled movie titles in Italian as well, but when I want to crawl English titles, there are still some movies in Imdb which has non English characters)
EDIT: Here is my code:
String baseUrl = "http://www.imdb.com/search/title?at=0&count=250&sort=num_votes,desc&start="+start+"&title_type=feature&view=simple";
label1: try {
org.jsoup.Connection con = Jsoup.connect(baseUrl).userAgent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21").header("Accept-Language", "en");
con.timeout(30000).ignoreHttpErrors(true).followRedirects(true);
Response resp = con.execute();
Document doc = null;
if (resp.statusCode() == 200) {
doc = con.get();
Elements myElements = doc.getElementsByClass("results").first().getElementsByTag("table");
Elements trs = myElements.select(":not(thead) tr");
for (int i = 0; i < trs.size(); i++) {
Element tr = trs.get(i);
Elements tds = tr.select("td");
for (int j = 3; j < tds.size(); j++) {
Elements links = tds.select("a[href]");
String titleId = links.attr("href");
String movietitle = links.html();
//I ADDED YOUR CODE HERE
Charset c = Charset.forName("UTF-16BE");
ByteBuffer b = c.encode(movietitle);
for (int m = 0; b.hasRemaining(); m++) {
int charValue = (b.get()) & 0xff;
System.out.print((char) charValue);
}
// try{
// String query = "INSERT into test (movieName,ImdbId)" + "VALUES (?,?)";
// PreparedStatement preparedStmt = conn.prepareStatement(query);
// preparedStmt.setString (1, movietitle);
// preparedStmt.setString (2, titleId );
// }catch (Exception e)
// {
// e.printStackTrace();
// }
Thanks,
Upvotes: 1
Views: 140
Reputation: 2404
Here, I copy pasted the string shared in the question and tried
public class Test {
public static void main (String...a) throws Exception {
String s = "Il più crudele dei giorni";
Charset c = Charset.forName("UTF-16BE");
ByteBuffer b = c.encode(s);
for (int i = 0; b.hasRemaining(); i++) {
int charValue = (b.get()) & 0xff;
System.out.print((char) charValue);
}
}
}
This prints the s
as it is on the console. I assume that you already have part of code which writes to a file. You can try integrating the above code if it works for you.
Upvotes: 1