WebCrawler with recursion

Question

So I am working on a webcrawler that is supposed to download all images, files, and webpages, and then recursively do the same for all webpages found. However, I seem to have a logic error.

    public class WebCrawler {

   private static String url;
   private static int maxCrawlDepth;
   private static String filePath;

   /* Recursive function that crawls all web pages found on a given web page.
    * This function also saves elements from the DownloadRepository to disk.
    */

  public static void crawling(WebPage webpage, int currentCrawlDepth, int maxCrawlDepth) {

     webpage.crawl(currentCrawlDepth);

     HashMap pages = webpage.getCrawledWebPages();

        if(currentCrawlDepth < maxCrawlDepth) {
           for(WebPage wp : pages.values()) {
              crawling(wp, currentCrawlDepth+1, maxCrawlDepth);
           }
        }
     }

   public static void main(String[] args) {

      if(args.length != 3) {
         System.out.println("Must pass three parameters");
         System.exit(0);
      }

      url = "";
      maxCrawlDepth = 0;
      filePath = "";

      url = args[0];
      try {
         URL testUrl = new URL(url);
         URLConnection urlConnection = testUrl.openConnection();
         urlConnection.connect();
      } catch (MalformedURLException e) {
         System.out.println("Not a valid URL");
         System.exit(0);
      } catch (IOException e) {
         System.out.println("Could not open URL");
         System.exit(0);
      }

      try {
         maxCrawlDepth = Integer.parseInt(args[1]);
      } catch (NumberFormatException e) {
         System.out.println("Argument is not an int");
         System.exit(0);
      }

      filePath = args[2];
      File path = new File(filePath);
      if(!path.exists()) {
         System.out.println("File Path is invalid");
         System.exit(0);
      }

      WebPage webpage = new WebPage(url);
      crawling(webpage, 0, maxCrawlDepth);

      System.out.println("Web crawl is complete");
   }

}

the function crawl will parse the contents of a website storing any found images, files, or links into a hashmap, for example:

    public class WebPage implements WebElement {

   private static Elements images;
   private static Elements links;

   private HashMap webImages = new HashMap();
   private HashMap webPages = new HashMap();
   private HashMap files = new HashMap();

   private String url;

   public WebPage(String url) {
      this.url = url;
   }

   /* The crawl method parses the html on a given web page
    * and adds the elements of the web page to the Download
    * Repository.
    */
   public void crawl(int currentCrawlDepth) {

      System.out.print("Crawling " + url + " at crawl depth ");
      System.out.println(currentCrawlDepth + "
");

      Document doc = null;

      try {
         HttpConnection httpConnection = (HttpConnection) Jsoup.connect(url);
         httpConnection.ignoreContentType(true);
         doc = httpConnection.get();

      } catch (MalformedURLException e) {
         System.out.println(e.getLocalizedMessage()); 
      } catch (IOException e) {
         System.out.println(e.getLocalizedMessage());
      } catch (IllegalArgumentException e) {
         System.out.println(url + "is not a valid URL");
      }

      DownloadRepository downloadRepository = DownloadRepository.getInstance();

      if(doc != null) {
         images = doc.select("img");
         links = doc.select("a[href]");

         for(Element image : images) {
            String imageUrl = image.absUrl("src");
            if(!webImages.containsValue(image)) {
               WebImage webImage = new WebImage(imageUrl);
               webImages.put(imageUrl, webImage);
               downloadRepository.addElement(imageUrl, webImage);
               System.out.println("Added image at " + imageUrl);
            }
         }

         HttpConnection mimeConnection = null;
         Response mimeResponse = null;

         for(Element link: links) {
            String linkUrl = link.absUrl("href");
            linkUrl = linkUrl.trim();
            if(!linkUrl.contains("#")) {
               try {
                  mimeConnection = (HttpConnection) Jsoup.connect(linkUrl);
                  mimeConnection.ignoreContentType(true);
                  mimeConnection.ignoreHttpErrors(true);
                  mimeResponse = (Response) mimeConnection.execute();
               } catch (Exception e) {
                  System.out.println(e.getLocalizedMessage());
               }

               String contentType = null;
               if(mimeResponse != null) {
                  contentType = mimeResponse.contentType();
               }

               if(contentType == null) {
                  continue;
               }
               if(contentType.toString().equals("text/html")) {
                  if(!webPages.containsKey(linkUrl)) {
                     WebPage webPage = new WebPage(linkUrl);
                     webPages.put(linkUrl, webPage);
                     downloadRepository.addElement(linkUrl, webPage);
                     System.out.println("Added webPage at " + linkUrl);
                  }
               }
               else {
                  if(!files.containsValue(link)) {
                     WebFile webFile = new WebFile(linkUrl);
                     files.put(linkUrl, webFile);
                     downloadRepository.addElement(linkUrl, webFile);
                     System.out.println("Added file at " + linkUrl);
                  }
               }

            }
         }

      }

      System.out.print("
Finished crawling " + url + " at crawl depth ");
      System.out.println(currentCrawlDepth + "
");
   }

   public HashMap getImages() {
      return webImages;
   }

   public HashMap getCrawledWebPages() {
      return webPages;
   }

   public HashMap getFiles() {
      return files;
   }

   public String getUrl() {
      return url;
   }

   @Override
   public void saveToDisk(String filePath) {
      System.out.println(filePath);
   }
}

The point of using a hashmap is to ensure that I do not parse the same website more than once. The error seems to be with my recursion. What is the issue?

Here is also some sample output for starting the crawl at http://www.google.com

Crawling https://www.google.com/ at crawl depth 0

Added webPage at http://www.google.com/intl/en/options/
Added webPage at https://www.google.com/intl/en/ads/
Added webPage at https://www.google.com/services/
Added webPage at https://www.google.com/intl/en/about.html
Added webPage at https://www.google.com/intl/en/policies/
Finished crawling https://www.google.com/ at crawl depth 0

Crawling https://www.google.com/services/ at crawl depth 1

Added webPage at http://www.google.com/intl/en/enterprise/apps/business/?utm_medium=et&utm_campaign=en&utm_source=us-en-et-nelson_bizsol
Added webPage at https://www.google.com/services/sitemap.html
Added webPage at https://www.google.com/intl/en/about/
Added webPage at https://www.google.com/intl/en/policies/
Finished crawling https://www.google.com/services/ at crawl depth 1

**Crawling https://www.google.com/intl/en/policies/ at crawl depth 2**

Added webPage at https://www.google.com/intl/en/policies/
Added webPage at https://www.google.com/intl/en/policies/terms/
Added webPage at https://www.google.com/intl/en/policies/privacy/
Added webPage at https://www.google.com/intl/en/policies/terms/
Added webPage at https://www.google.com/intl/en/policies/faq/
Added webPage at https://www.google.com/intl/en/policies/technologies/
Added webPage at https://www.google.com/intl/en/about/
Added webPage at https://www.google.com/intl/en/policies/

Finished crawling https://www.google.com/intl/en/policies/ at crawl depth 2

**Crawling https://www.google.com/intl/en/policies/ at crawl depth 3**

Notice that it parses http://www.google.com/intl/en/policies/ twice

WebCrawler with recursion

Answers (1)

Related Questions