Tano
Tano

Reputation: 1377

Java jsoup link ignore

I have the following code:

private static final Pattern FILE_FILTER = Pattern.compile(
        ".*(\\.(css|js|bmp|gif|jpe?g|png|tiff?|mid|mp2|mp3|mp4|wav|avi|mov|mpeg|ram|m4v|pdf" +
                "|rm|smil|wmv|swf|wma|zip|rar|gz))$");

private boolean isRelevant(String url) {
    if (url.length() < 1)  // Remove empty urls
        return false;
    else if (FILE_FILTER.matcher(url).matches()) {
        return false;
    }
    else
        return TLSpecific.isRelevant(url);
}

I am using this part when i am parsing a web site to check whether it contains links that contains some of the patterns declared, but I dont know is there a way to do it directly through jsoup and optimize the code. For example given a web page how I can ignore all of them with jsoup?

Upvotes: 1

Views: 821

Answers (1)

Stephan
Stephan

Reputation: 43023

how I can ignore all of them with jsoup?

Let's say we want any element not having jpg or jpeg extension in their hrefor src attribute.

String filteredLinksCssQuery = "[href]:not([href~=(?i)\\.jpe?g$]), " + //
                               "[src]:not([src~=(?i)\\.jpe?g$])";

String html = "<a href='foo.jpg'>foo</a>" + //
              "<a href='bar.svg'>bar</a>" + //
              "<script src='baz.js'></script>";

Document doc = Jsoup.parse(html);

for(Element e: doc.select(filteredLinksCssQuery)) {
    System.out.println(e);
}

OUTPUT

<a href="bar.svg">bar</a>
<script src="baz.js"></script>

[href]                      /* Select any element having an href attribute... */
:not([href~=(?i)\.jpe?g$])  /* ... but exclude those matching the regex (?i)\.jpe?g$ */
,                           /* OR */
[src]                       /* Select any element having a src attribute... */
:not([src~=(?i)\.jpe?g$])   /* ... but exclude those matching the regex (?i)\.jpe?g$ */

You can add more extensions to filter. You may want to write some code for generating filteredLinksCssQuery automatically because this CSS query can quickly become unmaintainable.

Upvotes: 3

Related Questions