kmaz13
kmaz13

Reputation: 260

Making Depth First Search continue after first pass?

I am trying to create a basic depth first search based web crawler. Here is my current code:

import java.util.*;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.io.*;
import java.net.*;

public class DepthFirstSpider {
    private List<String> visitedList; //web pages already visited
    private static String hrefExpr = "href\\s*=\\s*\"([^\"]+)\"";
    private static Pattern pattern = Pattern.compile(hrefExpr);
    private int limit;
    private static Matcher matcher;
    private static URL contextURL;
    private static URL url;

    public List<String>  getVisitedList() { return visitedList; }

    //initialize the visitedlist and limit instance variables. Visit the starting url.
    public DepthFirstSpider(int limit, String startingURL) {
        visitedList = new ArrayList<String>();
        this.limit = limit;
        try {
            contextURL = new URL(startingURL);
        } catch (MalformedURLException e) {

        }

        visit(startingURL);
    }

    //print and add urlString to list of visited web pages 
    //create url and connect, read through html contents:
    //when href encountered create new url relative to the current url and visit it (if not already visited and limit not reached)
    public void visit(String urlString) {
        try{
            url = new URL(contextURL, urlString);
            URLConnection connection = url.openConnection();
            InputStream inputStream = connection.getInputStream();
            BufferedReader reader = new BufferedReader(
                    new InputStreamReader(inputStream));
            String nextLine;
            while((nextLine=reader.readLine()) != null){
                matcher = pattern.matcher(nextLine);
                while(matcher.find() && limit > 0 && !visitedList.contains(url.toString())){
                    System.out.println("visiting " + url.toString());
                    visitedList.add(url.toString());
                    visit(matcher.group(1));
                    limit--;
                }
            }
        } catch (MalformedURLException e){

        } catch (IOException e){

        }
    }

}

The search currently shoots down the tree of webpages without a problem. I need help making it go back up and then going to the pages it missed. Thanks for the help.

Upvotes: 0

Views: 499

Answers (2)

ehanoc
ehanoc

Reputation: 2217

I might be missing something, but,

in depth first, you need to keep track of the Expanded nodes as well. each generated child nodes you should add them to a stack (FILO) .

you should push() every expanded node to a stack and pop() at each iteration. when you reach the limit you will be poping upper nodes.

is this homework ?

you can find an ok explanation in pseudo-code in wikipedia.

Upvotes: 1

ulu5
ulu5

Reputation: 437

When I did a crawler, I used two queues instead of just one list. One queue contained the urls to visit and the other contained urls visited. I added all URLs I wanted to visit to the toVisit queue and as I visited those URLs I removed them from the toVisit queue(and added to the visited queue) and added all links on that page to the toVisit queue unless they were in the visited queue. There is no need to traverse in doing it this way.

Upvotes: 1

Related Questions