Reputation: 567
I want to crawl a website with Nutch and the Index it with Solr.
I have a website which have the following structure:
Homepage: example.com
Documents I want to index: subdomain.example.com/{some_number}.html
To "discover" all these documents I start from example.com/discover
which has a list of many documents that I want.
So what I have now is:
In my regex-urlfilter.txt
I set to crawl only documents from example.com and this works perfectly
I index with Solr and everything works well. I use the following command:
./$nutch/bin/crawl -i -s $nutch/urls/ $nutch/ 5
What I want now is to ONLY index the documents that are in the format: subdomain.example.com/{some_number}.html
, ignoring everything else (i.e. I don't want to index example.com/discover
)
I guess this is done by changing some configuration in Solr, since it's the indexing part.
Upvotes: 0
Views: 428
Reputation: 3253
In this case, the configuration could be done on the Nutch side. Filtering the documents before they're sent to Solr.
If you only want to "index" (meaning that you want to fetch&parse all the links, but store only on Solr the ones that match the regex) you can use the index-jexl-filter. With this plugin, you can write a small JEXL script to check if the URL of a document matches your regex and if it does it will be sent to Solr.
The script could be something like (configured on your nutch-site.xml
file):
url =~ "^https?:\/\/[a-z]+\.example.com\/(\d+).html"
url
is a default primitive available on the JEXL context. You can find more info about this on https://github.com/apache/nutch/blob/master/conf/nutch-default.xml#L1755-L1771If by "index" you really meant to only crawl the URLs that match your regex (if it doesn't match it will not be fetched nor parsed) then you can use the same regex-urlfilter.txt
to define the desired format. Keep in mind that with this approach you would need to run the crawl again.
Upvotes: 3