Belgarion
Belgarion

Reputation: 159

Recursively retrieving links from web page in Java

I'm working on a simplified website downloader (Programming Assignment) and I have to recursively go through the links in the given url and download the individual pages to my local directory.

I already have a function to retrieve all the hyperlinks(href attributes) from a single page, Set<String> retrieveLinksOnPage(URL url). This function returns a vector of hyperlinks. I have been told to download pages up to level 4. (Level 0 being the Home Page) Therefore I basically want to retrieve all the links in the site but I'm having difficulty coming up with the recursion algorithm. In the end, I intend to call my function like this :

retrieveAllLinksFromSite("http://www.example.com/ldsjf.html",0)

Set<String> Links=new Set<String>();
Set<String> retrieveAllLinksFromSite (URL url, int Level,Set<String> Links)
{
    if(Level==4)
       return;
    else{

        //retrieveLinksOnPage(url,0);
        //I'm pretty Lost Actually!
        }

}

Thanks!

Upvotes: 1

Views: 2301

Answers (2)

gigadot
gigadot

Reputation: 8969

Here is the pseudo code:

Set<String> retrieveAllLinksFromSite(int Level, Set<String> Links) {
    if (Level < 5) {
        Set<String> local_links =  new HashSet<String>();
        for (String link : Links) {
            // do download link
            Set<String> new_links = ;// do parsing the downloaded html of link;
            local_links.addAll(retrieveAllLinksFromSite(Level+1, new_links));
        }
        return local_links;
    } else {
        return Links;
    }

}

You will need to implement thing in the comments yourself. To run the function from a given single link, you need to create an initial set of links which contains only one initial link. However, it also works if you ahve multiple initial links.

Set<String> initial_link_set = new HashSet();
initial_link_set.add("http://abc.com/");
Set<String> final_link_set = retrieveAllLinksFromSite(1, initial_link_set);

Upvotes: 3

Rndm
Rndm

Reputation: 6860

You can use a HashMap instead of a Vector to store the links and their levels (since you need to recursively get all links down to level 4)

Also , it would be something like this(just giving an overall hint) :

HashMap Links=new HashMap();

void retrieveAllLinksFromSite (URL url, int Level)
{
    if(Level==4)
       return;
    else{
        retrieve the links on current page and for each retrieved link,
        do {
           download the link
           Links.put(the retrieved url,Level)  // store the link with level in hashmap
           retrieveAllLinksFromSite (the retrieved url ,Level+1) //recursively call for

 further levels
            }

        }

}

Upvotes: 0

Related Questions