Parse https with jsoup (java)

i try to parse a document with jsoup (java). This is my java-code:

    package test;

import java.io.IOException;
import java.net.MalformedURLException;
import java.net.URL;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class crawler{
  private static final int TIMEOUT_IN_MS = 5000;

  public static void main(String[] args) throws MalformedURLException, IOException
  {
    Document doc = Jsoup.parse(new URL("http://www.internet.com/"), TIMEOUT_IN_MS);

    System.out.println(doc.html());
  }

}

Ok, this works. But when i want to parse a https site, i get this error message:

    Document doc = Jsoup.parse(new URL("https://www.somesite.com/"), TIMEOUT_IN_MS);

System.out.println(doc.html());

Exception in thread "main" org.jsoup.HttpStatusException: HTTP error fetching URL. Status=403, URL=https://www.somesite.com/ at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:590) at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:540) at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:227) at org.jsoup.helper.HttpConnection.get(HttpConnection.java:216) at org.jsoup.Jsoup.parse(Jsoup.java:183) at test.crawler.main(crawler.java:14)

I only get this error messages, when i try to parse https. http is working.

Upvotes: 0

Views: 2660

Answers (3)

green-creeper
green-creeper

Reputation: 316

You could also just ignore SSL certificate if it's required

Jsoup.connect("https://example.com").validateTLSCertificates(false).get()

Upvotes: 0

HARDI
HARDI

Reputation: 394

You would need to provide authentication when hitting the URL. Also try the solution in 403 Forbidden with Java but not web browser? if the request works in a browser and not using JAVA code.

Upvotes: 1

Jonathan Hedley
Jonathan Hedley

Reputation: 10522

Jsoup supports https fine - it just uses Java's URLConnection under the hood.

A 403 server response indicates that the server has 'forbidden' the request, normally due to authorization issues. If you're getting a HTTP response status code, the TLS (https) negotiation has worked.

The issue here is probably not related to HTTPS, it just that the URL you're having troubles fetching happens to be HTTPS. You need to understand why the server is giving you a 403 - my guess is either you need to send some authorization tokens (cookies or URL params), or it is blocking the request because of the user agent (which defaults to "Java" unless you specify it). Lots of services block requests that way. Look to set the useragent to a common browser string. Use the Jsoup.Connect methods to do that.

(People won't be able to help you more without real example URLs, because we can't tell what the server is doing just with this info.)

Upvotes: 1

Related Questions