Rafael Steil
Rafael Steil

Reputation: 4697

Parallel processing of several files in a cluster

At the company I work for, everyday we have to process a few thousands of files, which takes some hours. The operations are basically CPU intensive, like converting PDF to high resolution images and later creating many different sizes os such images.

Each one of those tasks takes a lot of CPU, and therefore we can't simply start many instances on the same machine because there won't be any processing power available for everything. Thus, it takes some hours to finish everything.

The most obvious thing to do, as I see it, is to partition the set of files and have them processed by more machines concurrently (5, 10, 15 machines, I don't know yet how many would be necessary).

I don't want to reinvent the wheel and create a manager for task (nor do I want the hassle), but I am not sure which tool should I use.

Although we don't have big data, I have looked at Hadoop for a start (we are running at Amazon), and its capabilities of handling the nodes seem interesting. However, I don't know if it makes sense to use it. I am looking at Hazelcast as well, but I have no experience at all with it or the concepts yet.

What would be a good approach for this task?

Upvotes: 1

Views: 659

Answers (2)

Praveen Sripati
Praveen Sripati

Reputation: 33495

Hadoop is being used for a wide variety of data processing problems, some of them are related to image processing also. The problem mentioned in the OP can also be easily solved using Hadoop. Note that in some cases where the data to processed is small, then there is an overhead using Hadoop.

If you are new to Hadoop, would suggest a couple of things

  • Buy the Hadoop : The Definitive Guide book.
  • Go through the MapReduce resources.
  • Start going through the tutorials (1 and 2) and setup Hadoop on a single node and a cluster. There is no need for Amazon, if 1-2 machines can be spared for learning.
  • Run the sample programs and understand how they work.
  • Start migrating the problem area to Hadoop.

The advantage of Hadoop over other s/w is the ecosystem around Hadoop. As of now the ecosystem around Hadoop is huge and growing, I am not sure of Hazelcast.

Upvotes: 1

enesness
enesness

Reputation: 3133

You can use Hazelcast distributed queue.

First you can put your files (file references) as tasks to a distributed queue. Then each node takes a task from the queue processes it and puts the result into another distributed queue/list or write it to DB/storage.

Upvotes: 0

Related Questions