Gregory Wullimann
Gregory Wullimann

Reputation: 567

Indexing only specific domains with Solr and Nutch

I want to crawl a website with Nutch and the Index it with Solr.

I have a website which have the following structure:

Homepage: example.com

Documents I want to index: subdomain.example.com/{some_number}.html

To "discover" all these documents I start from example.com/discover which has a list of many documents that I want.

So what I have now is:

In my regex-urlfilter.txt I set to crawl only documents from example.com and this works perfectly

I index with Solr and everything works well. I use the following command:

./$nutch/bin/crawl -i -s $nutch/urls/ $nutch/ 5

What I want now is to ONLY index the documents that are in the format: subdomain.example.com/{some_number}.html, ignoring everything else (i.e. I don't want to index example.com/discover)

I guess this is done by changing some configuration in Solr, since it's the indexing part.

Upvotes: 0

Views: 428

Answers (1)

Jorge Luis
Jorge Luis

Reputation: 3253

In this case, the configuration could be done on the Nutch side. Filtering the documents before they're sent to Solr.

If you only want to "index" (meaning that you want to fetch&parse all the links, but store only on Solr the ones that match the regex) you can use the index-jexl-filter. With this plugin, you can write a small JEXL script to check if the URL of a document matches your regex and if it does it will be sent to Solr.

The script could be something like (configured on your nutch-site.xml file):

url =~ "^https?:\/\/[a-z]+\.example.com\/(\d+).html"

If by "index" you really meant to only crawl the URLs that match your regex (if it doesn't match it will not be fetched nor parsed) then you can use the same regex-urlfilter.txt to define the desired format. Keep in mind that with this approach you would need to run the crawl again.

Upvotes: 3

Related Questions