Reputation: 4260
I am using Nutch 2.0 to crawl some websites but I do not see HTML meta tags like title, description are extracted and stored in MySQL database. Any idea how can I get it work?
Thanks Arash
Upvotes: 1
Views: 4224
Reputation: 1255
Take a look at the latest patches regarding Nutch 2.x
Although i can store metadata in the database, i can't figure out how to transfer it to Solr.
Upvotes: 0
Reputation: 764
Index-Metatags plugin is not included in the 2.x series. Please check http://wiki.apache.org/nutch/Nutch2Plugins for more information. There is a patch in there that makes the plugin work for 2.x series.
1.6 is the stable version for Nutch right now as the above author has pointed out in the comment.
Upvotes: 0
Reputation: 1114
Make sure to include the parse-metatags
and index-metadata
plugins in your plugin.includes
definition in nutch-site.xml
Then add metatags.names
index.parse.md
and index.content.md
properties and point them to the appropriate tags. Take a look at mine:
<property>
<name>plugin.includes</name>
<value>protocol-http|protocol-httpclient|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>
<property>
<name>metatags.names</name>
<value>*</value>
</property>
<property>
<name>index.parse.md</name>
<value>metatag.description,metatag.author,metatag.twitter:image</value>
</property>
<property>
<name>index.content.md</name>
<value>author,description,twitter:image</value>
</property>
Test your configuration. I ran this test against an article on readwrite.com:
bin/nutch indexchecker http://readwrite.com/2013/03/20/whats-behind-china-attacks-on-apple-and-android
The output will tell you if you're parsing the correct values. In my case I wanted author
, description
and twitter:image
:
fetching: http://readwrite.com/2013/03/20/whats-behind-china-attacks-on-apple-and-android
parsing: http://readwrite.com/2013/03/20/whats-behind-china-attacks-on-apple-and-android
contentType: text/html
content : What's Really Behind China's Attacks On Apple And Android? – ReadWrite Sections Sections Social Mobi
title : What's Really Behind China's Attacks On Apple And Android? – ReadWrite
host : readwrite.com
metatag.author : Brian S Hall
tstamp : Wed Mar 20 13:33:38 EDT 2013
metatag.twitter:image : http://readwrite.com/files/styles/150_150sc/public/fields/China%20graphic%20brian%20final.jpg
metatag.description : Repeated outbursts suggest China could be growing concerned over America's dominance in the smartpho
url : http://readwrite.com/2013/03/20/whats-behind-china-attacks-on-apple-and-android
A downside is that parse-metatags
will only match tags by name and not property. For example <meta name="foo" content="bar">
is fine while an Open Graph tags like <meta property="og:image" content="http://readwrite.com/sample.jpg" />
will be missed.
Upvotes: 5
Reputation: 9049
Take a look at IndexMetaTags plugin for Nutch, available from version 1.5 onwards.
This will allow you to specify which meta tags to parse and index.
Note: The names of the fields must be prefixed with 'metatags.'
You can check the index using Nutch indexchecker
Upvotes: 2