breakdown1986
breakdown1986

Reputation: 489

UTF-8 characters not showing properly

I am using Nutch 1.4 and solr 3.3.0 to crawl and index my site which is in French. My site used to be in iso8859-1.

Currently I have 2 indexes under solr. In the first one I store my old pages (in iso8859-1) and in the second one I store my new pages (in utf-8).

I use the same nutch configurations for both of the crawl jobs to get and index the old and the new pages on my site. I have not added any settings about charters encodings on my own ( i think).

I am facing problem when searching the new pages thats supposed to be in utf-8. The french characters doesn't display properly. But for the old pages thats in iso8859-1 everything seems to be fine.

I was wondering if anyone could point me in the right direction for fixing this problem.

I believe the problem comes from the nutch since when I created the dump of the segments I saw those funny character in the dump file.

Thank you.

Upvotes: 1

Views: 2809

Answers (2)

Nikolay Spassov
Nikolay Spassov

Reputation: 1316

In nutch-default.xml "parser.character.encoding.default" value should be set accordingly. You just have to set it to utf-8. Its default value is "windows-1252".

Upvotes: 3

Adam Gent
Adam Gent

Reputation: 49085

I'm not as familiar with Nutch but I have seen this with other things.

A couple of things you should check or do:

  1. Your new pages on the web server may not be content negotiating that its UTF-8
  2. Your charset meta tags for the new pages may still be iso8859-1

What I recommend you do is take all the old pages of your old site and use a tool like iconv to convert them to UTF-8. Then in your web server configure it so that all text is treated as UTF-8 (that is the content-type header sent back says UTF-8).

Upvotes: 0

Related Questions