eljane
eljane

Reputation: 1

Language Detection in Solr for Nutch documents

How can I use Solr for language identification of documents obtained by crawling with nutch?

I installed Nutch 1.9 and Solr 4.8.1. I added a new core, named "core-test" to solr by means of Core Admin in the Solr Admin page and I followed the steps in Solr wiki for language detection during documents indexing.

I modified the schema.xml in core-test/conf by adding the field

<field name="language_s" type="string" stored="true" indexed="true"/>

Then, I used Nutch for crawling a set of web pages by

crawl seed.txt Test http://localhost:8983/solr/core-test 2

Nutch works appropriately but the language of the documents is not identified, i.e. I don't obtain the field language_s when I make a query in http://localhost:8983/solr/#/core-test/query with q set to ":".

Upvotes: 0

Views: 784

Answers (1)

aalbahem
aalbahem

Reputation: 782

You need to enable the language detection of Nutch. Copy the xml tag below to Nutch_HOME/conf/nutch-site.xml:

<property> <name>plugin.includes</name> <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|language-identifier</value> </property>

The above tag enables the language-detection plugin bundled with Nutch. As described in Nutch's wiki, the plugin will add a field named "lang" which contains the language code of your documents.

Upvotes: 2

Related Questions