Treper
Treper

Reputation: 3653

How to join two datasets according to their common key in hadoop?

I have two datasets Customer and Goods. The Customer dataset has customer id as key and a list of goods's id the customer bought as value. The Goods datasets has good it as key and its price as value. How to join this two dataset according to the foreign key goods' id?

customer dataset:
customer id, goods id1,goods id2, ...

goods dataset
goods id1, price1
goods id2, price2

The join result dataset I want:
customer id1,price1,price2,...
customer id2,pric3e,price4,...

I am new in hadoop.I know it can be done in Pig and Hive but I want to implement it in java with Hadoop. Can anybody help me? Many thanks!

Upvotes: 2

Views: 2408

Answers (3)

Arun A K
Arun A K

Reputation: 2225

May be I can add on to Paul's answer. You can use a concept of distributed cache here. Load the smaller of your file, which I guess is goods dataset in your case, into the Distributed Cache. (Distributed Cache can hold upto 10Gb of data by default). Then you can use a normal map to read the customer dataset and perform a join using the matching data from the distributed cache.

The interesting fact is that the data in Distributed cache data can be accessed by every mapper irrespective of the datanode.

http://bigdatapartnership.com/map-side-and-reduce-side-joins/ can provide you insights on join in MapReduce applications.

Hadoop: The Definitive Guide By Tom White gives program example on Map Side Join, Reduce Side Join and Join with Distributed Cache.

Chapter 5 of Hadoop In Action by Chuck Lam also discusses on joins.

Upvotes: 1

Praveen Sripati
Praveen Sripati

Reputation: 33495

Check the Relational Joins section in the Data-Intensive Text Processing with MapReduce document.

Upvotes: 1

Paul M
Paul M

Reputation: 2046

how big is the "Goods" dataset? If it is small enough, the easiest thing to do is to load it into memory in your mappers (in a hashmap) and then make the "Customers" dataset the input to your job. Then you could run your job and lookup the "Goods" as you iterate over your input. You can use the distributed cache to get your "Goods" data distributed to each node in the cluster.

Upvotes: 0

Related Questions