Ravindra babu
Ravindra babu

Reputation: 38910

Hadoop joining two different data sets using java at Maper or Reducer end

I have two different data sets.

***Comments.csv:*** 

id
userid

***Posts.csv-***

id
post_type
creationdate
score
viewcount
owneruserid
title
answercount
commentcount

I have display name and no. of posts created by the user who has got maximum reputation.

I know the code for how Map Reduce works using single file. I know how to set multiple files for Job. But I don't know how to join different data sets at Mapper level.

I am not sure if I can join these two data sets with one Mapper.

    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      String[] data = value.toString().split(",");
      // Logic to write values to context 

    }

MultipleInputs.addInputPath(job,new Path(args[0]),TextInputFormat.class,CommentsMapper.class);
MultipleInputs.addInputPath(job,new Path(args[1]),TextInputFormat.class,PostsMapper.class);

My queries:

1. Map side join or Reduce side join :  Which one is better?. 

2. Is it possible to use single Mapper or Reducer? If yes, how is it possible?

Provide me inputs to achieve this in a simple way. I have gone through Stackoverflow questions regarding multiple data files to Job but the input format is same for all those files. In my case, the input format is different.

Thanks in advance.

Upvotes: 0

Views: 706

Answers (1)

vlahmot
vlahmot

Reputation: 116

To perform a reduce side join you can have your map implementations emit

(K,V) -> (JOIN_KEY,DATA).

Then on the reduce side you will have access to all of the values associated with that key. If you wanted to ensure for example that your Post data is first in the list and then all of the comment data is after that you can implement a secondary sort.

Secondary Sort

Upvotes: 1

Related Questions