Reputation: 2108
I want to only download sites with content type "text/html" and do not download pdf/mp4/rar... files
for now my code is this:
Connection connection = Jsoup.connect(linkInfo.getLink()).followRedirects(false).validateTLSCertificates(false).userAgent(USER_AGENT);
Document htmlDocument = connection.get();
if (!connection.response().contentType().contains("text/html")) {
return;
}
Isn't there any thing like:
Jsoup.connect(linkInfo.getLink()).contentTypeOnly("text/html");
Upvotes: 0
Views: 104
Reputation: 12473
If you mean that you need a way to know if a file is HTML before actually downloading it, then you can use a HEAD request. This will request just the headers, so you can check if it is text/html
before actually downloading the file. The method you were using does not really work because you are downloading the file and parsing it as HTML before checking, which will throw an exception on non-HTML files.
Connection connection = Jsoup.connect(linkInfo.getLink())
.method(Connection.Method.HEAD)
.validateTLSCertificates(false)
.followRedirects(false)
.userAgent(USER_AGENT);
Connection.Response head = connection.execute();
if (!head.contentType().contains("text/html")) return;
Document html = Jsoup.connect(head.url())
.validateTLSCertificates(false)
.followRedirects(false)
.userAgent(USER_AGENT)
.get();
Upvotes: 2