Reputation: 4697
At the company I work for, everyday we have to process a few thousands of files, which takes some hours. The operations are basically CPU intensive, like converting PDF to high resolution images and later creating many different sizes os such images.
Each one of those tasks takes a lot of CPU, and therefore we can't simply start many instances on the same machine because there won't be any processing power available for everything. Thus, it takes some hours to finish everything.
The most obvious thing to do, as I see it, is to partition the set of files and have them processed by more machines concurrently (5, 10, 15 machines, I don't know yet how many would be necessary).
I don't want to reinvent the wheel and create a manager for task (nor do I want the hassle), but I am not sure which tool should I use.
Although we don't have big data, I have looked at Hadoop for a start (we are running at Amazon), and its capabilities of handling the nodes seem interesting. However, I don't know if it makes sense to use it. I am looking at Hazelcast as well, but I have no experience at all with it or the concepts yet.
What would be a good approach for this task?
Upvotes: 1
Views: 659
Reputation: 33495
Hadoop is being used for a wide variety of data processing problems, some of them are related to image processing also. The problem mentioned in the OP can also be easily solved using Hadoop. Note that in some cases where the data to processed is small, then there is an overhead using Hadoop.
If you are new to Hadoop, would suggest a couple of things
The advantage of Hadoop over other s/w is the ecosystem around Hadoop. As of now the ecosystem around Hadoop is huge and growing, I am not sure of Hazelcast.
Upvotes: 1
Reputation: 3133
You can use Hazelcast distributed queue.
First you can put your files (file references) as tasks to a distributed queue. Then each node takes a task from the queue processes it and puts the result into another distributed queue/list or write it to DB/storage.
Upvotes: 0