Reputation: 1377
I am developing a web crawler but I got stuck, because I cannot get all the reachable links, here is my code:
public class SNCrawler extends Thread {
Specific s;
HashSet<String> hs = new HashSet<String>();
public SNCrawler(Specific s)
{
this.s = s;
}
public void crawl(String url) throws IOException {
Document doc = Jsoup.connect(url).get();
Elements links = doc.select("a");
for (Element link : links)
{
if(isSuitable(link.attr("href")) && !hs.contains(link.attr("abs:href")))
{
hs.add(link.attr("href"));
crawl(link.attr("href"));
}
}
}
public boolean isSuitable(String site)
{
boolean myBool = false;
if(site.startsWith("http://www.svensktnaringsliv.se/") && !SNFilter.matcher(site).matches())
if(site.contains(".pdf")) {
hs.add(site);
myBool=true;
}else{
hs.add(site);
myBool=true;
}
return myBool;
}
private static final Pattern SNFilter = Pattern.compile(".*((/staff/|medarbetare|play|/member_organizations/|/sme_committee/|rm=print|/contact/|/brussels-office/|/about-us|/newsletter/|/advantagesweden/|service=print|#)).*");
@Override
public void run()
{
try {
crawl("http://www.svensktnaringsliv.se/english/");
for(String myS : hs)
{
System.out.println(myS);
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
when the program reaches this part of the website it doesn get any links from there, is the same things for this page, from there I get only 2 or 3 links, I have looked at the code for many hours but cant really figute it out why I got stuck
Upvotes: 1
Views: 3586
Reputation: 43083
when the program reaches this part of the website it doesn get any links from there
The crawl function should work with absolute urls only. Try the function below instead:
public void crawl(String url) throws IOException {
Document doc = Jsoup.connect(url).get();
Elements links = doc.select("a");
for (Element link : links) {
String foundUrl = link.attr("abs:href").toLowerCase();
if( isSuitable(foundUrl) && ( !hs.contains(foundUrl) ) ) {
hs.add(foundUrl);
crawl(foundUrl);
}
}
}
Upvotes: 1