Reputation: 2225

Efficient way to GET multiple HTML pages simultaneously

So I'm working on web scraping for a certain website. The problem is:

Given a set of URLs (in the order of 100s to 1000s), I would like to retrieve the HTML of each URL in an efficient manner, specially time-wise. I need to be able to do 1000s of requests every 5 minutes.

This should usually imply using a pool of threads to do requests from a set of not yet requested urls. But before jumping into implementing this, I believe that it's worth asking here since I believe this is a fairly common problem when doing web scraping or web crawling.

Is there any library that has what I need?

Upvotes: 2

Answers (4)

ehsun7b

Reputation: 4866

http://wwww.Jsoup.org just for scrapping part! The thread pooling i think you should implement urself.

Update

if this approach is fitting your need, you can download the complete class files here: http://codetoearn.blogspot.com/2013/01/concurrent-web-requests-with-thread.html

    AsyncWebReader webReader = new AsyncWebReader(5/*number of threads*/, new String[]{
              "http://www.google.com",
              "http://www.yahoo.com",
              "http://www.live.com",
              "http://www.wikipedia.com",
              "http://www.facebook.com",
              "http://www.khorasannews.com",
              "http://www.fcbarcelona.com",
              "http://www.khorasannews.com",
            });

    webReader.addObserver(new Observer() {
      @Override
      public void update(Observable o, Object arg) {
        if (arg instanceof Exception) {
          Exception ex = (Exception) arg;
          System.out.println(ex.getMessage());
        } /*else if (arg instanceof List) {
          List vals = (List) arg;
          System.out.println(vals.get(0) + ": " + vals.get(1));
        } */else if (arg instanceof Object[]) {
          Object[] objects = (Object[]) arg;
          HashMap result = (HashMap) objects[0];
          String[] success = (String[]) objects[1];
          String[] fail = (String[]) objects[2];

          System.out.println("Failds");
          for (int i = 0; i < fail.length; i++) {
            String string = fail[i];
            System.out.println(string);
          }

          System.out.println("-----------");
          System.out.println("success");
          for (int i = 0; i < success.length; i++) {
            String string = success[i];
            System.out.println(string);
          }

          System.out.println("\n\nresult of Google: ");
          System.out.println(result.remove("http://www.google.com"));
        }
      }
    });
    Thread t = new Thread(webReader);
    t.start();
    t.join();

Upvotes: 0

Miserable Variable

Reputation: 28752

So I'm working on web scraping for a certain website.

Are you scraping a single server or is the website scraping from multiple other hosts? If it is the former, then the server you are scraping may not like too many concurrent connections from a single i/p.

If it is the latter, this is really a general question on how many outbound connections you should open from a machine. There is physical limit, but it is pretty large. Practically, it would depend on where that client is getting deployed. The better the connectivity, the higher number of connections it can accommodate.

You might want to look at the source code of a good download manager to see if they have a limit on the number of outbound connections.

Definitely user asynchronous i/o, but you would still do well to limit the number.

Upvotes: 1

syb0rg

Reputation: 8247

I would start by researching asynchronous communication. Then take a look at Netty.

Keep in mind there is always a limit to how fast one can load a web page. For an average home connection, it will be around a second. Take this into consideration when programming your application.

Upvotes: 0

Eric J.

Reputation: 150108

Your bandwidth utilization will be the sum of all of the HTML documents that you retrieve (plus a little overhead) no matter how you slice it (though some web servers may support compressed HTTP streams, so certainly use a client capable of accepting them).

The optimal number of concurrent threads depends a great deal on your network connectivity to the sites in question. Only experimentation can find an optimal number. You can certainly use one set of threads for retrieving HTML documents and a separate set of threads to process them to make it easier to find the right balance.

I'm a big fan of HTML Agility Pack for web scraping in the .NET world but cannot make a specific recommendation for Java. The following question may be of use in finding a good, Java based scraping platform

Web scraping with Java

Upvotes: 0

Efficient way to GET multiple HTML pages simultaneously

Answers (4)

Related Questions