peppered
peppered

Reputation: 698

HttpClient multithread performance

I have an application which downloads more than 4500 html pages from 62 target hosts using HttpClient (4.1.3 or 4.2-beta). It runs on Windows 7 64-bit. Processor - Core i7 2600K. Network bandwidth - 54 Mb/s.

At this moment it uses such parameters:

In this case my network usage (in Windows task manager) does not rise above 2.5%. To download 4500 pages it takes 70 minutes. And in HttpClient logs I have such things:

DEBUG ForkJoinPool-2-worker-1 [org.apache.http.impl.conn.PoolingClientConnectionManager]: Connection released: [id: 209][route: {}->http://stackoverflow.com][total kept alive: 6; route allocated: 1 of 5; total allocated: 10 of 80]

Total allocated connections do not raise above 10-12, in spite of that I've set it up to 80 connections. If I'll try to rise parallelism level to 20 or 80, network usage remains the same but a lot connection time-outs will be generated.

I've read tutorials on hc.apache.org (HttpClient Performance Optimization Guide and HttpClient Threading Guide) but they does not help.

Task's code looks like this:

public class ContentDownloader extends RecursiveAction {
    private final HttpClient httpClient;
    private final HttpContext context;
    private List<Entry> entries;

    public ContentDownloader(HttpClient httpClient, List<Entry> entries){
        this.httpClient = httpClient;
        context = new BasicHttpContext();
        this.entries = entries;
    }

    private void computeDirectly(Entry entry){      
        final HttpGet get = new HttpGet(entry.getLink());
        try {
            HttpResponse response = httpClient.execute(get, context);
            int statusCode = response.getStatusLine().getStatusCode();

            if ( (statusCode >= 400) && (statusCode <= 600) ) {
                logger.error("Couldn't get content from " + get.getURI().toString() + "\n"  + response.toString());
            } else {        
                HttpEntity entity = response.getEntity();
                if (entity != null) {
                    String htmlContent = EntityUtils.toString(entity).trim();
                    entry.setHtml(htmlContent);
                    EntityUtils.consumeQuietly(entity);                             
                }
            }                           
        } catch (Exception e) {
        } finally {
            get.releaseConnection();
        }
    }

    @Override
    protected void compute() {
        if (entries.size() <= 1){           
            computeDirectly(entries.get(0));
            return;         
        }       
        int split = entries.size() / 2;     
        invokeAll(new ContentDownloader(httpClient, entries.subList(0, split)), 
                new ContentDownloader(httpClient, entries.subList(split, entries.size())));
    }
}

And the question is - what is the best practice to use multi threaded HttpClient, may be there is a some rules for setting up ConnectionManager and HttpClient? How can I use all of 80 connections and raise network usage?

If necessary, I will provide more code.

Upvotes: 3

Views: 10043

Answers (3)

Thomas
Thomas

Reputation: 12029

The remote site could limit the number of parallel connections from one IP. In fact this is good practice since many crawlers are badly implemented and incur high burden on the servers.

You should at least respect the robots.txt and limit your requests to one per second per remote ip if you crawl a public site and not your own.

In addition to that you have a maximum number of connections per route (that is http://www.example.com/[whatever]) of five, so you can expect to have at most 5 parallel connections to one remote "site". (The path is ignored, just scheme, host and port.)

Upvotes: 1

Mark
Mark

Reputation: 847

I'm not sure how many different hosts you are pulling from but if it's a small number (or just 1), you want to increase the max per route. This will increase your concurrency per host.

Currently you have it set to 5. Your're observing a max connection usage of up to 10-12, perhaps you're only hitting 2-3 different hosts, in which case, the math adds up.

Upvotes: 4

ok2c
ok2c

Reputation: 27593

Apache HttpClient should definitely be fast enough to saturate bandwidth even of loopback interfaces. I suspect that the performance issue has more to do with the efficiency of content processing than that of content retrieval. Your application is simply spending more time processing HTML content and extracting links than downloading new pages thus causing bandwidth under-utilization. Even the fact that your code converts HTML content to String prior to processing it leads me to believe that your application spends more time copying stuff in memory than transferring data across the wire.

Upvotes: 0

Related Questions