Shaggy
Shaggy

Reputation: 97

Nutch crawler not indexing HTML content

I am trying to develop a search functionality where I enter a city name and it gives me the weather conditions for that city.
I have set up Nutch-1.3 and Solr-3.4.0 on my system. The website I am crawling is here and passing the index to Solr for searching.Now, I want to retrieve the information displayed on this link, on querying for delhi.

How can I achieve this? Does it require any plugin to be written?

 <doc><float name="score">1.0</float><float name="boost">0.1879294</float><str name="content"/><str name="digest">d41d8cd98f00b204e9800998ecf8427e</str><str name="id">http://www.imd.gov.in/section/nhac/distforecast/delhi.htm</str><str name="segment">20111118153543</str><str name="title"/><date name="tstamp">2011-11-18T10:06:45.604Z</date><str name="url">http://www.imd.gov.in/section/nhac/distforecast/delhi.htm</str></doc>

Upvotes: 0

Views: 554

Answers (1)

Jayendra
Jayendra

Reputation: 52779

Nutch basically crawls through links on the pages.
However, there are no links on the India page for it to reach the Delhi page mentioned by you.
So it won't be able to navigate it down to that page.

You can create your own dummy html page, acting as the start url for indexing, and have all the links you want Nutch to index.

Whats the default search field in you schema ?
Usually its the text field, and querying for delhi would look into that field for matches.
As *:* returns the delhi result, and delhi does not. Its not matching the indexed tokens on the field it is searching on.

Whats the field type defined for url in the schema ?
You can copy the field to an other field with text analysis, which would produce the delhi token and querying for url_copy:delhi should return you the results.

Upvotes: 1

Related Questions