mohsenJsh
mohsenJsh

Reputation: 2108

how just get url with html type with jsoup

I want to only download sites with content type "text/html" and do not download pdf/mp4/rar... files

for now my code is this:

 Connection connection = Jsoup.connect(linkInfo.getLink()).followRedirects(false).validateTLSCertificates(false).userAgent(USER_AGENT);

 Document htmlDocument = connection.get();

 if (!connection.response().contentType().contains("text/html")) {

     return;
 }

Isn't there any thing like:

Jsoup.connect(linkInfo.getLink()).contentTypeOnly("text/html");

Upvotes: 0

Views: 104

Answers (1)

Leo Aso
Leo Aso

Reputation: 12473

If you mean that you need a way to know if a file is HTML before actually downloading it, then you can use a HEAD request. This will request just the headers, so you can check if it is text/html before actually downloading the file. The method you were using does not really work because you are downloading the file and parsing it as HTML before checking, which will throw an exception on non-HTML files.

Connection connection = Jsoup.connect(linkInfo.getLink())
    .method(Connection.Method.HEAD)
    .validateTLSCertificates(false)
    .followRedirects(false)
    .userAgent(USER_AGENT);

Connection.Response head = connection.execute();
if (!head.contentType().contains("text/html")) return;

Document html = Jsoup.connect(head.url())
    .validateTLSCertificates(false)
    .followRedirects(false)
    .userAgent(USER_AGENT)
    .get();

Upvotes: 2

Related Questions