sushmita
sushmita

Reputation: 25

javax.net.ssl.SSLHandshakeException for some https url in nutch 1.13

I try crawling seed urls that are http/https but for few https urls i get below error FetcherThread INFO api.HttpRobotRulesParser (168) - Couldn't get robots.txt for https://corporate.douglas.de/investors/?lang=en: javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target

on other hand https://www.integrafin.co.uk/annual-reports/ is crawled perfectly fine

below is my configuration plugin.includes protocol-http|urlfilter-regex|parse-(html|tika|text)|index-(basic|anchor|more|static|links)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|urlmeta|language-identifier

Upvotes: 0

Views: 844

Answers (2)

Jorge Luis
Jorge Luis

Reputation: 3253

You could try using a more recent version of Nutch, or compile directly from master, and then give a try to the http.tls.certificates.check setting, from (https://github.com/apache/nutch/pull/388). This will essentially allow you to skip the TLS/SSL verification.

Upvotes: 0

JosemyAB
JosemyAB

Reputation: 407

I think you need to put the certificate of server https://corporate.douglas.de/investors/?lang=en in the "cacerts" file of the JVM that runs your code.

First, download the certificate using Chrome: enter image description here

Then, click in "details" tab and then in button "Copy to file" enter image description here

In the wizard, select the option "DER binary.... (.CER)"

Now, you can use the tool "portecle" (http://portecle.sourceforge.net/) to add the certificate to the cacert file in your JVM followin this steps http://portecle.sourceforge.net/import-trusted-cert.html

Hope works for you.

Upvotes: 0

Related Questions