Reputation: 63
I have a 50TB set of ~1GB tiff images that I need to run the same algorithm on. Currently, I have the rectification process written in C++ and it works well, however it will take forever to run on all these images consecutively. I understand that an implementation of MapReduce/Spark could work, but I can't seem to figure out how to use image input/output.
Every tutorial/example that I've seen uses plain text. In theory, I would like to utilize Amazon Web Services too. If anyone has some direction for me, that would be great. I'm obviously not looking for a full solution, but maybe someone has successfully implemented something close to this? Thanks in advance.
Upvotes: 4
Views: 3935
Reputation: 14379
One aspect of solving a problem in MapReduce paradigm, which most developers are not aware of is that:
If you do complex calculation on your Data nodes, the system will limp.
A big reason why you see mostly text-based simple examples is that they are actually the kind of problems which you can run on commercial grade hardware. In case you don't know or have forgotten, I'd like to point out that:
MapReduce programming paradigm is for running the kind of jobs that need scaling out vs scaling up.
Some hints:
Upvotes: 2
Reputation: 294227
Is your data in HDFS? What exactly do you expect to leverage from Hadoop/Spark? Seems to me that all you need is a queue of filenames and a bunch of machines to execute.
You can pack your app into AWS Lambda (see Running Arbitrary Executables in AWS Lambda) and trigger events for each file. You can pack your app into a Docker container and start up a bunch of them in ECS, let them loose on a queue of filenames (or URLs or S3 buckets) to process.
I think Hadoop/Spark is overkill, specially since they're quite bad at processing 1GB splits as input, and your processing is not a M/R (no key-values for reducers to merge). If you must, you could pack your C++ app to read from stdin and use Hadoop Streaming.
Ultimately, the question is: where are the 50TB data stored, and what format? The solution depends a lot on the answer, as you would like to bring compute to where the data is and avoid transferring 50TB into AWS or even upload into HDFS.
Upvotes: 3