Reputation: 4260

Extracting HTML meta tags in Nutch 2.x and having Solr 4 index it

I am using Nutch 2.0 to crawl some websites but I do not see HTML meta tags like title, description are extracted and stored in MySQL database. Any idea how can I get it work?

Thanks Arash

Upvotes: 1

Answers (4)

chris

Reputation: 1255

Take a look at the latest patches regarding Nutch 2.x
Although i can store metadata in the database, i can't figure out how to transfer it to Solr.

Upvotes: 0

kich

Reputation: 764

Index-Metatags plugin is not included in the 2.x series. Please check http://wiki.apache.org/nutch/Nutch2Plugins for more information. There is a patch in there that makes the plugin work for 2.x series.

1.6 is the stable version for Nutch right now as the above author has pointed out in the comment.

Upvotes: 0

Butifarra

Reputation: 1114

Make sure to include the parse-metatags and index-metadata plugins in your plugin.includes definition in nutch-site.xml

Then add metatags.names index.parse.md and index.content.md properties and point them to the appropriate tags. Take a look at mine:

<property>
        <name>plugin.includes</name>
        <value>protocol-http|protocol-httpclient|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>
<property>
        <name>metatags.names</name>
        <value>*</value>
</property>
<property>
        <name>index.parse.md</name>
        <value>metatag.description,metatag.author,metatag.twitter:image</value>
</property>
<property>
        <name>index.content.md</name>
        <value>author,description,twitter:image</value>
</property>

Test your configuration. I ran this test against an article on readwrite.com:

bin/nutch indexchecker http://readwrite.com/2013/03/20/whats-behind-china-attacks-on-apple-and-android

The output will tell you if you're parsing the correct values. In my case I wanted author, description and twitter:image:

fetching: http://readwrite.com/2013/03/20/whats-behind-china-attacks-on-apple-and-android
parsing: http://readwrite.com/2013/03/20/whats-behind-china-attacks-on-apple-and-android
contentType: text/html
content :   What's Really Behind China's Attacks On Apple And Android? – ReadWrite Sections Sections Social Mobi
title : What's Really Behind China's Attacks On Apple And Android? – ReadWrite
host :  readwrite.com
metatag.author :    Brian S Hall
tstamp :    Wed Mar 20 13:33:38 EDT 2013
metatag.twitter:image : http://readwrite.com/files/styles/150_150sc/public/fields/China%20graphic%20brian%20final.jpg
metatag.description :   Repeated outbursts suggest China could be growing concerned over America's dominance in the smartpho
url :   http://readwrite.com/2013/03/20/whats-behind-china-attacks-on-apple-and-android

A downside is that parse-metatags will only match tags by name and not property. For example <meta name="foo" content="bar"> is fine while an Open Graph tags like <meta property="og:image" content="http://readwrite.com/sample.jpg" /> will be missed.

Upvotes: 5

Srikanth Venugopalan

Reputation: 9049

Take a look at IndexMetaTags plugin for Nutch, available from version 1.5 onwards.

This will allow you to specify which meta tags to parse and index.

Note: The names of the fields must be prefixed with 'metatags.'

You can check the index using Nutch indexchecker

Upvotes: 2

Extracting HTML meta tags in Nutch 2.x and having Solr 4 index it

Answers (4)

Related Questions