Reputation: 784
I have a method which allows me to get all URLs from the page (and optional - to check is it valid). But it works only for 1 page, I want to check all the website. Need to make a recursion.
private static FirefoxDriver driver;
public static void main(String[] args) throws Exception {
driver = new FirefoxDriver();
driver.get("https://example.com/");
List<WebElement> allURLs = findAllLinks(driver);
report(allURLs);
// here are my trials for recursion
for (WebElement element : allURLs) {
driver.get(element.getAttribute("href"));
List<WebElement> allUrls = findAllLinks(driver);
report(allUrls);
}
}
public static List findAllLinks(WebDriver driver)
{
List<WebElement> elementList = driver.findElements(By.tagName("a"));
elementList.addAll(driver.findElements(By.tagName("img")));
List finalList = new ArrayList();
for (WebElement element : elementList)
{
if(element.getAttribute("href") != null)
{
finalList.add(element);
}
}
return finalList;
}
public static void report(List<WebElement> allURLs) {
for(WebElement element : allURLs){
System.out.println("URL: " + element.getAttribute("href")+ " returned " + isLinkBroken(new URL(element.getAttribute("href"))));
}
}
See comment "here are my trials for recursion". But it goes through the first page, then again through the first page and that's all.
Upvotes: 0
Views: 2734
Reputation: 2950
You're trying to write a web crawler. I am a big fan of code reuse. Which is to say I always look around to see if my project has already been written before I spend the time writing it myself. And there are many versions of web crawlers out there. One written by Marilena Panagiotidou pops up early in a google search. Leaving out the imports, her basic version looks like this.
public class BasicWebCrawler {
private HashSet<String> links;
public BasicWebCrawler() {
links = new HashSet<String>();
}
public void getPageLinks(String URL) {
//4. Check if you have already crawled the URLs
//(we are intentionally not checking for duplicate content in this example)
if (!links.contains(URL)) {
try {
//4. (i) If not add it to the index
if (links.add(URL)) {
System.out.println(URL);
}
//2. Fetch the HTML code
Document document = Jsoup.connect(URL).get();
//3. Parse the HTML to extract links to other URLs
Elements linksOnPage = document.select("a[href]");
//5. For each extracted URL... go back to Step 4.
for (Element page : linksOnPage) {
getPageLinks(page.attr("abs:href"));
}
} catch (IOException e) {
System.err.println("For '" + URL + "': " + e.getMessage());
}
}
}
public static void main(String[] args) {
//1. Pick a URL from the frontier
new BasicWebCrawler().getPageLinks("http://www.mkyong.com/");
}
}
Probably the most important thing to note here is how the recursion works. A recursive method is one that calls itself. Your example above is not recursion. You have a method findAllLinks that you call once on a page, and then once for every link found in the page. Notice how Marilena's getPageLinks method calls itself once for every link it finds in a page at a given URL. And in calling itself it creates a new stack frame and a generates a new set of links from a page and calls itself again once for every link, etc. etc.
Another important thing to note about a recursive function is when does it stop calling itself. In this case Marilena's recursive function keeps calling itself until it can't find any new links. If the page you are crawling links to pages outside its domain, this program could run for a very long time. And, incidentally, what probably happens in that case is where this website got it's name: a StackOverflowError.
Upvotes: 1
Reputation: 166
Make sure you are not visiting the same URL twice. Add some table where you store already visited URLs. Since every page might be starting with header that is linked to the home page you might end up visiting it over and over again, for example.
Upvotes: 0