Reputation: 25
I try crawling seed urls that are http/https but for few https urls i get below error FetcherThread INFO api.HttpRobotRulesParser (168) - Couldn't get robots.txt for https://corporate.douglas.de/investors/?lang=en: javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
on other hand https://www.integrafin.co.uk/annual-reports/ is crawled perfectly fine
below is my configuration plugin.includes protocol-http|urlfilter-regex|parse-(html|tika|text)|index-(basic|anchor|more|static|links)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|urlmeta|language-identifier
Upvotes: 0
Views: 844
Reputation: 3253
You could try using a more recent version of Nutch, or compile directly from master, and then give a try to the http.tls.certificates.check
setting, from (https://github.com/apache/nutch/pull/388). This will essentially allow you to skip the TLS/SSL verification.
Upvotes: 0
Reputation: 407
I think you need to put the certificate of server https://corporate.douglas.de/investors/?lang=en in the "cacerts" file of the JVM that runs your code.
First, download the certificate using Chrome:
Then, click in "details" tab and then in button "Copy to file"
In the wizard, select the option "DER binary.... (.CER)"
Now, you can use the tool "portecle" (http://portecle.sourceforge.net/) to add the certificate to the cacert file in your JVM followin this steps http://portecle.sourceforge.net/import-trusted-cert.html
Hope works for you.
Upvotes: 0