Reputation: 1
How can I use Solr for language identification of documents obtained by crawling with nutch?
I installed Nutch 1.9 and Solr 4.8.1.
I added a new core, named "core-test"
to solr by means of Core Admin in the Solr Admin page and I followed the steps in Solr wiki for language detection during documents indexing.
I modified the schema.xml in core-test/conf by adding the field
<field name="language_s" type="string" stored="true" indexed="true"/>
Then, I used Nutch for crawling a set of web pages by
crawl seed.txt Test http://localhost:8983/solr/core-test 2
Nutch works appropriately but the language of the documents is not identified, i.e. I don't obtain the field language_s
when I make a query in http://localhost:8983/solr/#/core-test/query with q
set to ":"
.
Upvotes: 0
Views: 784
Reputation: 782
You need to enable the language detection of Nutch. Copy the xml tag below to Nutch_HOME/conf/nutch-site.xml
:
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|language-identifier</value>
</property>
The above tag enables the language-detection plugin bundled with Nutch. As described in Nutch's wiki, the plugin will add a field named "lang" which contains the language code of your documents.
Upvotes: 2