Reputation: 69920
I have a huge DynamoDB table that I want to analyze to aggregate data that is stored in its attributes. The aggregated data should then be processed by a Java application. While I understand the really basic concepts behind MapReduce, I've never used it before.
In my case, let's say that I have a customerId
and orderNumbers
attribute in every DynamoDB item, and that I can have more than one item for the same customer. Like:
customerId: 1, orderNumbers: 2
customerId: 1, orderNumbers: 6
customerId: 2, orderNumbers: -1
Basically I want to sum the orderNumbers for each customerId, and then execute some operations in Java with the aggregate.
AWS Elastic MapReduce could probably help me, but I don't understand how do I connect a custom JAR with DynamoDB. My custom JAR probably needs to expose both a map
and reduce
functions, where can I find the right interface to implement?
Plus I'm a bit confused by the docs, it seems like I should first export my data to S3 before running my custom JAR. Is this correct?
Thanks
Upvotes: 7
Views: 4669
Reputation: 2069
Also see: http://aws.amazon.com/code/Elastic-MapReduce/28549 - which also uses Hive to access DynamoDB. This seems to be the official AWS way of accessing DynamoDB from Hadoop.
If you need to write custom code in a custom JAR, I found: DynamoDB InputFormat for Hadoop
However, I could not find documentation on the Java parameters to set for this InputFormat that correspond to the Hive parameters. According to this article, it was not released by Amazon: http://www.newvem.com/amazon-dynamodb-part-iii-mapreducin-logs/
Also see: jar containing org.apache.hadoop.hive.dynamodb
Therefore, the official, documented way to use DynamoDB data from a custom MapReduce job is to export the data DynamoDB to S3, then let Elastic MapReduce take it from S3. My guess this is because because DynamoDB was designed to be accessed randomly as a key/value "NoSQL" store, while Hadoop input and output formats are for sequential access with large block sizes. The Amazon undocumented code could be some tricks to make up for this gap.
Since the export/re-import uses up resources, it would be best if the task can be accomplished from within Hive.
Upvotes: 0
Reputation: 10052
Note: I haven't built a working EMR, just read about it.
First of all, Prerequisites for Integrating Amazon EMR with Amazon DynamoDB
You can work directly on DynamoDB: Hive Command Examples for Exporting, Importing, and Querying Data in Amazon DynamoDB, As you can see you can do "SQL-like" queries that way.
If you have zero knowledge about Hadoop you should probably read some introduction material such as: What is Hadoop
This tutorial is another good read Using Amazon Elastic MapReduce with DynamoDB
Regarding your custom JAR application, you need to upload it to S3. Use this guide: How to Create a Job Flow Using a Custom JAR
I hope this will help you get started.
Upvotes: 3