Reputation: 3653
I have two datasets Customer and Goods. The Customer dataset has customer id as key and a list of goods's id the customer bought as value. The Goods datasets has good it as key and its price as value. How to join this two dataset according to the foreign key goods' id?
customer dataset:
customer id, goods id1,goods id2, ...
goods dataset
goods id1, price1
goods id2, price2
The join result dataset I want:
customer id1,price1,price2,...
customer id2,pric3e,price4,...
I am new in hadoop.I know it can be done in Pig and Hive but I want to implement it in java with Hadoop. Can anybody help me? Many thanks!
Upvotes: 2
Views: 2408
Reputation: 2225
May be I can add on to Paul's answer. You can use a concept of distributed cache here. Load the smaller of your file, which I guess is goods dataset in your case, into the Distributed Cache. (Distributed Cache can hold upto 10Gb of data by default). Then you can use a normal map to read the customer dataset and perform a join using the matching data from the distributed cache.
The interesting fact is that the data in Distributed cache data can be accessed by every mapper irrespective of the datanode.
http://bigdatapartnership.com/map-side-and-reduce-side-joins/ can provide you insights on join in MapReduce applications.
Hadoop: The Definitive Guide By Tom White gives program example on Map Side Join, Reduce Side Join and Join with Distributed Cache.
Chapter 5 of Hadoop In Action by Chuck Lam also discusses on joins.
Upvotes: 1
Reputation: 33495
Check the Relational Joins
section in the Data-Intensive Text Processing
with MapReduce document.
Upvotes: 1
Reputation: 2046
how big is the "Goods" dataset? If it is small enough, the easiest thing to do is to load it into memory in your mappers (in a hashmap) and then make the "Customers" dataset the input to your job. Then you could run your job and lookup the "Goods" as you iterate over your input. You can use the distributed cache to get your "Goods" data distributed to each node in the cluster.
Upvotes: 0