Reputation: 128
I have two big files and have uploaded them into an Amazon S3 bucket named "ccssdd" and created a folder named data: data/friendships.xml data/users.xml
structure of users is
<user>
<id>1</id>
<age>24</age>
<x>4</x>
<y>7</y>
<interest>football</ineterest>
</user>
<user>
..
and
<friendship>
<user1>1</user1>
<user2>3</user2>
</friendship>
<friendship>
..
I need to write a job jar to run it on Amazon Elastic Map Reduce to compute: Find the number of friends for each user.
I know I should produce pairs from each friendship element as the output of map function and in the reduce function, I should sum the "1"s for each userid.
1_ I know that I can run my app in the eclipse to produce .jar job file but I don't know what libraries I should download and add to the project.
2- I really don't know how I can connect my application to s3! and get xml elements one by one and extract user id from them
Please kindly help me with that. I've found this tutorial which is very similar to my problem however when I copy it to eclipse I get error for almost every line, none of .org libraries are known and ... Also, I have no idea how I can access to data files which are on the S3 ...
Upvotes: 1
Views: 245
Reputation: 30310
Here is one approach.
Use a distribution from Cloudera, MaprR, or wherever and use the versions (jars) of Hadoop available in the distribution. Make sure you test your jobs thoroughly locally so you feel confident everything works. This is because you will be charged at an hourly rate (per machine) by Amazon even if your job only goes for 30 seconds before failing.
Once you are confident, create an "uber jar" containing all your code and all the classes in the Hadoop jars you used.
Upload the jar and data to S3 as described in this excellent tutorial. EMR works seamlessly with S3.
Run the job as described in the tutorial. If something goes wrong, wait a while after the job is done to check the logs because there is a lag.
Hope that helps.
Upvotes: 1