user1025852
user1025852

Reputation: 2784

crawler4j crawls only seed URLs

Why does the following code build upon crawler4j only crawl the given seed URLs and does not start to crawl other links?

public static void main( String[] args )
{
      String crawlStorageFolder = "F:\\crawl";
      int numberOfCrawlers = 7;

      CrawlConfig config = new CrawlConfig();
      config.setCrawlStorageFolder(crawlStorageFolder);
      config.setMaxDepthOfCrawling(4);
      /*
       * Instantiate the controller for this crawl.
       */
      PageFetcher pageFetcher = new PageFetcher(config);

      RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
      robotstxtConfig.setEnabled(false);

      RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
      CrawlController controller = null;
        try {
            controller = new CrawlController(config, pageFetcher, robotstxtServer);
        } catch (Exception e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }

      /*
       * For each crawl, you need to add some seed urls. These are the first
       * URLs that are fetched and then the crawler starts following links
       * which are found in these pages
       */
      controller.addSeed("http://edition.cnn.com/2016/05/11/politics/paul-ryan-donald-trump-meeting/index.html");        

      /*
       * Start the crawl. This is a blocking operation, meaning that your code
       * will reach the line after this only when crawling is finished.
       */
      controller.start(MyCrawler.class, numberOfCrawlers);

  }

Upvotes: 1

Views: 684

Answers (1)

rzo1
rzo1

Reputation: 5751

The official example is limited to the www.ics.uci.edu domain. Therefore, the shouldVisit method in the extending Crawler class needs to be adapted.

 /**
   * You should implement this function to specify whether the given url
   * should be crawled or not (based on your crawling logic).
   */
  @Override
  public boolean shouldVisit(Page referringPage, WebURL url) {
    String href = url.getURL().toLowerCase();
    // Ignore the url if it has an extension that matches our defined set of image extensions.
    if (IMAGE_EXTENSIONS.matcher(href).matches()) {
      return false;
    }

    // Only accept the url if it is in the "www.ics.uci.edu" domain and protocol is "http".
    return href.startsWith("http://www.ics.uci.edu/");
  }

Upvotes: 3

Related Questions