Web Crawler using jsoup

Question

I am developing a web crawler but I got stuck, because I cannot get all the reachable links, here is my code:

public class SNCrawler extends Thread {

    Specific s;

    HashSet hs = new HashSet();
    public SNCrawler(Specific s)
    {
        this.s = s;
    }

    public void crawl(String url) throws IOException {

        Document doc = Jsoup.connect(url).get();
        Elements links = doc.select("a");

        for (Element link : links)
        {
            if(isSuitable(link.attr("href")) && !hs.contains(link.attr("abs:href")))
            {
                hs.add(link.attr("href"));
                crawl(link.attr("href"));

            }
        }

    }

    public boolean isSuitable(String site)
    {
        boolean myBool = false;
        if(site.startsWith("http://www.svensktnaringsliv.se/") && !SNFilter.matcher(site).matches())
            if(site.contains(".pdf")) {
                hs.add(site);
                myBool=true;
            }else{
                hs.add(site);
                myBool=true;
            }
        return myBool;

    }

    private static final Pattern SNFilter = Pattern.compile(".*((/staff/|medarbetare|play|/member_organizations/|/sme_committee/|rm=print|/contact/|/brussels-office/|/about-us|/newsletter/|/advantagesweden/|service=print|#)).*");

    @Override
    public void run()
    {
        try {
            crawl("http://www.svensktnaringsliv.se/english/");
            for(String myS : hs)
            {
                System.out.println(myS);
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

when the program reaches this part of the website it doesn get any links from there, is the same things for this page, from there I get only 2 or 3 links, I have looked at the code for many hours but cant really figute it out why I got stuck

Stephan · Accepted Answer

when the program reaches this part of the website it doesn get any links from there

The crawl function should work with absolute urls only. Try the function below instead:

public void crawl(String url) throws IOException {
    Document doc = Jsoup.connect(url).get();
    Elements links = doc.select("a");

    for (Element link : links) {
        String foundUrl = link.attr("abs:href").toLowerCase();

        if( isSuitable(foundUrl) && ( !hs.contains(foundUrl) ) ) {
            hs.add(foundUrl);
            crawl(foundUrl);
        }
    }
}

Web Crawler using jsoup

Answers (1)

Related Questions