Abhinav
Abhinav

Reputation: 473

Best way to parallelize thousands of downloads

I am creating an application in which I have to download thousands of images (~1 MB each) using Java.

I take a list of Album URLs in my REST request, each Album contains multiple number of images.

So my request looks something like:

[
  "www.abc.xyz/album1",
  "www.abc.xyz/album2",
  "www.abc.xyz/album3",
  "www.abc.xyz/album4",
  "www.abc.xyz/album5"
]

Suppose each of these albums have 1000 images, so I need to download 50000 images in parallel.

Right now I have implemented it using parallelStream() but I feel that I can optimize it further.

There are two principle classes - AlbumDownloader and ImageDownloader (Spring components).

So the main application creates a parallelStream() on the list of albums:

albumData.parallelStream().forEach(ad -> albumDownloader.downloadAlbum(ad));

And a parallelStream() inside AlbumDownloader -> downloadAlbum() method as well:

List<Boolean> downloadStatus = albumData.getImageDownloadData().parallelStream().map(idd -> imageDownloader.downloadImage(idd)).collect(Collectors.toList());

I am thinking about using CompletableFuture with ExecutorService but I am not sure what pool size should I use?

Should I create a separate pool for each Album?

ExecutorService executor = Executors.newFixedThreadPool(Math.min(albumData.getImageDownloadData().size(), 1000));

That would create 5 different pools of 1000 threads each, that'll be like 5000 threads which might degrade the performance instead of improving.

Could you please give me some ideas to make it very very fast ?

I am using Apache Commons IO FileUtils to download files by the way and I have a machine with 12 available CPU cores.

Upvotes: 0

Views: 1086

Answers (2)

Stephen C
Stephen C

Reputation: 719376

The only way to make it "very very fast" is to get a "very very fast" network connection to the server; e.g. co-locate your client with the server that you are downloading from.

Your download speeds are going to be constrained by a number of potential bottlenecks. These include:

  1. The performance of the server; i.e. how fast it can assemble the data to send to you and push it through its network interface.

  2. Per-user request limits imposed by the service.

  3. The end-to-end performance of the network path between your client and the server.

  4. The performance of the machine you are running on in terms of moving data from the network and putting it (I guess) onto your local disk.

The bottleneck could be any of these, or a combination of them.

Throwing thousands of threads at the problem is unlikely to improve things. Indeed, if anything it is likely to make performance less than optimal. For example:

  • it could congest your network link, or
  • it could trigger anti-hogging or anti-DOS defenses in the server you are fetching from.

A better (simple) idea would be to use an ExecutorService with a small bounded worker pool, and submit the downloads to the pool as tasks.

Other things:

  1. Try to keep HTTP / HTTPS connections open between downloads from the same server. Some client libraries will do this kind of thing for you.
  2. If you have to download from a number of different servers, try to balance the load across the servers. Consider implementing per-server queues and trying to balance work so that individual servers don't see "bursts" of activity.

I would also advise you to make sure that you have permission to do what you are doing. Companies in the music publishing business have good lawyers. They could make your life unpleasant1 if they perceive you to be violating their terms and conditions or stealing their intellectual property.

1 - Like blocking your IP address or issuing take-down requests to your service provider.

Upvotes: 1

Gray
Gray

Reputation: 116898

Suppose each of these albums have 1000 images, so I need to download 50000 images in parallel.

It's wrong to think of your application doing 50000 things in parallel. What you are trying to do is to optimize your throughput – you are trying to download all of the images in the shortest amount of time.

You should try one fixed-sized thread-pool and then play around with the number of threads in the pool until your optimize your throughput – maybe start with double the number of processors. If your application is mostly waiting for network or the server then maybe you can increase the number of threads in the pool but you wouldn't want to overload the server so that it slows to a crawl and you wouldn't want to thrash your application with a huge number of threads.

That would create 5 different pools of 1000 threads each, that'll be like 5000 threads which might degrade the performance instead of improving.

I see no point in multiple pools unless there are different servers for each album or some other reason why the downloads from each album are different.

Upvotes: 1

Related Questions