Reputation: 489
I am using Nutch 1.4 and solr 3.3.0 to crawl and index my site which is in French. My site used to be in iso8859-1.
Currently I have 2 indexes under solr. In the first one I store my old pages (in iso8859-1) and in the second one I store my new pages (in utf-8).
I use the same nutch configurations for both of the crawl jobs to get and index the old and the new pages on my site. I have not added any settings about charters encodings on my own ( i think).
I am facing problem when searching the new pages thats supposed to be in utf-8. The french characters doesn't display properly. But for the old pages thats in iso8859-1 everything seems to be fine.
I was wondering if anyone could point me in the right direction for fixing this problem.
I believe the problem comes from the nutch since when I created the dump of the segments I saw those funny character in the dump file.
Thank you.
Upvotes: 1
Views: 2809
Reputation: 1316
In nutch-default.xml "parser.character.encoding.default" value should be set accordingly. You just have to set it to utf-8. Its default value is "windows-1252".
Upvotes: 3
Reputation: 49085
I'm not as familiar with Nutch but I have seen this with other things.
A couple of things you should check or do:
What I recommend you do is take all the old pages of your old site and use a tool like iconv to convert them to UTF-8. Then in your web server configure it so that all text is treated as UTF-8 (that is the content-type header sent back says UTF-8).
Upvotes: 0