Making Depth First Search continue after first pass?

Question

I am trying to create a basic depth first search based web crawler. Here is my current code:

import java.util.*;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.io.*;
import java.net.*;

public class DepthFirstSpider {
    private List visitedList; //web pages already visited
    private static String hrefExpr = "href\s*=\s*"([^"]+)"";
    private static Pattern pattern = Pattern.compile(hrefExpr);
    private int limit;
    private static Matcher matcher;
    private static URL contextURL;
    private static URL url;

    public List  getVisitedList() { return visitedList; }

    //initialize the visitedlist and limit instance variables. Visit the starting url.
    public DepthFirstSpider(int limit, String startingURL) {
        visitedList = new ArrayList();
        this.limit = limit;
        try {
            contextURL = new URL(startingURL);
        } catch (MalformedURLException e) {

        }

        visit(startingURL);
    }

    //print and add urlString to list of visited web pages 
    //create url and connect, read through html contents:
    //when href encountered create new url relative to the current url and visit it (if not already visited and limit not reached)
    public void visit(String urlString) {
        try{
            url = new URL(contextURL, urlString);
            URLConnection connection = url.openConnection();
            InputStream inputStream = connection.getInputStream();
            BufferedReader reader = new BufferedReader(
                    new InputStreamReader(inputStream));
            String nextLine;
            while((nextLine=reader.readLine()) != null){
                matcher = pattern.matcher(nextLine);
                while(matcher.find() && limit > 0 && !visitedList.contains(url.toString())){
                    System.out.println("visiting " + url.toString());
                    visitedList.add(url.toString());
                    visit(matcher.group(1));
                    limit--;
                }
            }
        } catch (MalformedURLException e){

        } catch (IOException e){

        }
    }

}

The search currently shoots down the tree of webpages without a problem. I need help making it go back up and then going to the pages it missed. Thanks for the help.

ulu5 · Accepted Answer

When I did a crawler, I used two queues instead of just one list. One queue contained the urls to visit and the other contained urls visited. I added all URLs I wanted to visit to the toVisit queue and as I visited those URLs I removed them from the toVisit queue(and added to the visited queue) and added all links on that page to the toVisit queue unless they were in the visited queue. There is no need to traverse in doing it this way.

Making Depth First Search continue after first pass?

Answers (2)

Related Questions