InternetMAn
InternetMAn

Reputation: 3

Scraping A Webpage With JSOUP and Given An SSL Error. Is This A Site Specific Issue? (JSOUP Works On Other Websites)

Scraping A Webpage With JSOUP and Given An SSL Error. Is This A Site Specific Issue? (JSOUP Works On Other Websites)

I'm trying to run a scrape, I run scrapes like this all the time, but this one failed. Normally I use jsoup to connect to a webpage, and then grab what I want on the page. This one appears to be trying to do an ssl handshake or something and failing.

I found this page with a similar issue, but, I think the op is having that issue on all jsoup scrapes, where mine is specific to this one website. https://www.strack.de/de/shop/?idm=1162&d=1&idmp=94&spb=MTQ7NzQ7MTI0OzEyMzY7 I have tried multiple pages on this site and all have the same issue. However, all other sites that I have tried don't have this issue at all and scrape normally.

I tried installing the latest version of java and restarting the pc, this didn't lead to the ssl connecting successfully. I also tried going onto Firefox and downloading the certification. That didn't seem to have the same pathway as described in the answer.

"more info" > "security" > "show certificate" > "details" > "export.."

I think this issue might be caused by a separate problem as the scraper works just fine on other websites. This is why I created this as a separate question as opposed to a comment on that one.

Here is what happened when I tried to download the cert. Instead of show certificate there is a view certificate, and it doesn't have the details option nor an export option. Trying to get the .cert file, no prompt

Am I doing something wrong that is prompting a handshake or is this some sort of functionality that disallows scraping on this website? I tried to scrape the pricing off of this webpage: https://www.strack.de/de/shop/?idm=1162&d=1&idmp=94&spb=MTQ7NzQ7MTI0OzEyMzY7

I used JSOUP to try to scrape this page. It gave me an error. When I googled it, it seems to be an error that people get when trying to connect to servers.

It gave me this error:

Exception in thread "main" javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target at sun.security.ssl.Alerts.getSSLException(Alerts.java:192) at sun.security.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1949) at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:302) at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:296) at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1509) at sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:216) at sun.security.ssl.Handshaker.processLoop(Handshaker.java:979) at sun.security.ssl.Handshaker.process_record(Handshaker.java:914) at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:1062) at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375) at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403) at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387) at sun.net.www.protocol.https.HttpsClient.afterConnect(HttpsClient.java:559) at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:185) at sun.net.www.protocol.https.HttpsURLConnectionImpl.connect(HttpsURLConnectionImpl.java:153) at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:732) at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:707) at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:297) at org.jsoup.helper.HttpConnection.get(HttpConnection.java:286) at scrapetestforstack.de.ScrapeTestForStackDe.main(ScrapeTestForStackDe.java:81) Caused by: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:387) at sun.security.validator.PKIXValidator.engineValidate(PKIXValidator.java:292) at sun.security.validator.Validator.validate(Validator.java:260) at sun.security.ssl.X509TrustManagerImpl.validate(X509TrustManagerImpl.java:324) at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:229) at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:124) at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1491) ... 15 more Caused by: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target at sun.security.provider.certpath.SunCertPathBuilder.build(SunCertPathBuilder.java:141) at sun.security.provider.certpath.SunCertPathBuilder.engineBuild(SunCertPathBuilder.java:126) at java.security.cert.CertPathBuilder.build(CertPathBuilder.java:280) at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:382) ... 21 more C:\Users\LeonardDME\AppData\Local\NetBeans\Cache\8.2\executor-snippets\run.xml:53: Java returned: 1 BUILD FAILED (total time: 0 seconds)

Here is the code that I am trying to do.


//Phase 3 Scrape The URL for Urls
Document doc = Jsoup.connect(URL).get();

   title = doc.title();
                    
         TitleFixer = title.replaceAll(" ", "");
         title = TitleFixer.replaceAll("|", "");
         TitleFixer = title.replaceAll("|", "");
        title = TitleFixer.replaceAll(";", "");
   
  //Set file writing stuff 1
            GimmeAName = ("C:\\Users\\LeonardDME\\Documents\\NetBeansProjects\\ScrapeTestForStackDe\\Urls\\" + title + ".csv");    
    File f = new File(GimmeAName);
    FileWriter fw = new FileWriter(f);
    PrintWriter out = new PrintWriter(fw);    
            StuffToWrite = URLArray[counter];
            // fetch the document over HTTP               
                 Elements spangrabbers = doc.getElementsByClass("art_orginal_preis142790");
                   for (Element spangrab : spangrabbers)
        {
        //System.out.println("New Span: ");
        //System.out.println(spangrab);
        holder2 = spangrab.text();
        //System.out.println(holder2);
        SpanHolderArray[SpanHolderCounter] = holder2;
        SpanHolderCounter++;
        }
            // get all links in page
            Elements links = doc.select("a[href]");
            for (Element link : links) {
                // get the value from the href attribute
                checker = link.attr("href");
                if (checker.contains("http"))
                {
                }
                else if(checker.contains("javascript"))
                {
                }
                else if(checker.contains("style"))
                {
                }
                else
                {
                    counter++;

                    if(LinkContorter == null && LinkContorter.isEmpty())
                    {
                        //do nothing
                    }
                    else
                    {
                        System.out.println(LinkContorter);
                        out.print(LinkContorter);
                        out.print(",");
                        out.print("\n");
                        //Flush the output to the file
                        out.flush();
                    }
                }
            }
            System.out.println(counter);
        //Close the Print Writer
       out.close();       
       //Close the File Writer
       fw.close();

Is it possible that a few of you could try to scrape this site, and see if you get the same result as me? I suspect that there might be some safegaurd against scraping, but, I don't want to abandon the task unless I know that to be the case for sure. I also used to scrape this same website a few months ago in February or March without an issue.

Upvotes: 0

Views: 839

Answers (2)

dave_thompson_085
dave_thompson_085

Reputation: 38930

Although as Firefox shows the cert used by this server does validate using the intermediate CA Sectigo RSA Domain Validation Secure Server CA and root CA USERTrust RSA Certification Authority, the server sends only the leaf cert and not the intermediate 'chain' cert as required by standards.

You can see this by looking at the SSLLabs test report; notice the orange warning in the summary SSLLabs summary and this near the bottom of the cert details box SSLLabs detail. Alternatively if you have (or get) OpenSSL openssl s_client -connect www.strack.de:443 -showcerts (many servers today require SNI and for OpenSSL below 1.1.1 to send SNI you need to add -servername $host but not this server), or since you have Java keytool -printcert -sslserver www.strack.de.

Omitting the required chain cert(s) is a common mistake by server admins who don't bother reading documentation, because if they only test with a browser or two they don't notice the mistake -- browsers frequently can work-around the missing chain cert(s), but most other software, including Java, either cannot or not by default. It is unlikely to be intended as a deliberate anti-scraping measure since it is easily bypassed, see next, but it does suggest the server admin doesn't have an actual goal or interest to support or assist scraping.

Instead of ignoring all cert problems as suggested by Krystian you can fix this by obtaining the chain cert -- e.g. by exporting from Firefox or by fetching the caIssuer link in the cert http://crt.sectigo.com/SectigoRSADomainValidationSecureServerCA.crt (shown in the SSLLabs report, or the keytool -printcert decode, or if you run the openssl s_client output into openssl x509 -noout -text) -- and adding it to your truststore (by default the file $JREDIR/lib/security/cacerts or jssecacerts unless you change it with a sysprop or code).

(added) Re Firefox, you already found that the UI has changed slightly in the years since the Q you linked: you now click on the padlock, then the right-arrow, More Information, View Certificate. To export a specific cert, click the tab for "Sectigo RSA ..." and scroll about halfway down to this: Firefox cert info then click "PEM(cert)" and save somewhere appropriate.

You could also report this problem to the site admin or owner(s). Whether they will care about you or non-browser access in general I have no idea.

Upvotes: 2

Krystian G
Krystian G

Reputation: 2941

If you're not sending any sensitive data you can cheat a little and configure your own TrustManger to accept everything.

TrustManager[] trustAllCerts = new TrustManager[] { new X509TrustManager() {
    public java.security.cert.X509Certificate[] getAcceptedIssuers() {
        return null;
    }

    public void checkClientTrusted(java.security.cert.X509Certificate[] certs, String authType) {
    }

    public void checkServerTrusted(java.security.cert.X509Certificate[] certs, String authType) {
    }
} };

SSLContext sc = SSLContext.getInstance("SSL");
sc.init(null, trustAllCerts, new java.security.SecureRandom());
HttpsURLConnection.setDefaultSSLSocketFactory(sc.getSocketFactory());

Now it will be used for EVERY connection so it's not recommended. Instead you can comment out the last line and make jsoup use it only for a single connection by specifying SSLSocketFactory like this:

// HttpsURLConnection.setDefaultSSLSocketFactory(sc.getSocketFactory()); // not needed anymore
Document doc = Jsoup.connect(URL)
                    .sslSocketFactory(sc.getSocketFactory())
                    .get();

Upvotes: 3

Related Questions