How to find all URLs recursively on a website -- java

Question

I have a method which allows me to get all URLs from the page (and optional - to check is it valid). But it works only for 1 page, I want to check all the website. Need to make a recursion.

private static FirefoxDriver driver;
public static void main(String[] args) throws Exception {
    driver = new FirefoxDriver();
    driver.get("https://example.com/");

    List allURLs = findAllLinks(driver);
    report(allURLs);

    // here are my trials for recursion
    for (WebElement element : allURLs) {
        driver.get(element.getAttribute("href"));
        List allUrls = findAllLinks(driver);
        report(allUrls);
    }
}
public static List findAllLinks(WebDriver driver)
{
    List elementList = driver.findElements(By.tagName("a"));
    elementList.addAll(driver.findElements(By.tagName("img")));
    List finalList = new ArrayList();
    for (WebElement element : elementList)
    {
        if(element.getAttribute("href") != null)
        {
            finalList.add(element);
        }
    }
    return finalList;
}
public static void report(List allURLs) {
    for(WebElement element : allURLs){
        System.out.println("URL: " + element.getAttribute("href")+ " returned " + isLinkBroken(new URL(element.getAttribute("href"))));
    }
}

See comment "here are my trials for recursion". But it goes through the first page, then again through the first page and that's all.

GMK · Accepted Answer

You're trying to write a web crawler. I am a big fan of code reuse. Which is to say I always look around to see if my project has already been written before I spend the time writing it myself. And there are many versions of web crawlers out there. One written by Marilena Panagiotidou pops up early in a google search. Leaving out the imports, her basic version looks like this.

public class BasicWebCrawler {

private HashSet links;

public BasicWebCrawler() {
    links = new HashSet();
}

public void getPageLinks(String URL) {
    //4. Check if you have already crawled the URLs 
    //(we are intentionally not checking for duplicate content in this example)
    if (!links.contains(URL)) {
        try {
            //4. (i) If not add it to the index
            if (links.add(URL)) {
                System.out.println(URL);
            }
            //2. Fetch the HTML code
            Document document = Jsoup.connect(URL).get();
            //3. Parse the HTML to extract links to other URLs
            Elements linksOnPage = document.select("a[href]");
            //5. For each extracted URL... go back to Step 4.
            for (Element page : linksOnPage) {
                getPageLinks(page.attr("abs:href"));
            }
        } catch (IOException e) {
            System.err.println("For '" + URL + "': " + e.getMessage());
        }
    }
}

public static void main(String[] args) {
    //1. Pick a URL from the frontier
    new BasicWebCrawler().getPageLinks("http://www.mkyong.com/");
}
}

Probably the most important thing to note here is how the recursion works. A recursive method is one that calls itself. Your example above is not recursion. You have a method findAllLinks that you call once on a page, and then once for every link found in the page. Notice how Marilena's getPageLinks method calls itself once for every link it finds in a page at a given URL. And in calling itself it creates a new stack frame and a generates a new set of links from a page and calls itself again once for every link, etc. etc.

Another important thing to note about a recursive function is when does it stop calling itself. In this case Marilena's recursive function keeps calling itself until it can't find any new links. If the page you are crawling links to pages outside its domain, this program could run for a very long time. And, incidentally, what probably happens in that case is where this website got it's name: a StackOverflowError.

How to find all URLs recursively on a website -- java

Answers (2)

Related Questions