How to process two files in Hadoop Mapreduce?

Question

I have to process two related files in Hadoop environment using Mapreduce. The first file is a huge log files which has been logging user's activity. And the second file is relatively small file which contains the details about users. Both are .txt files. First file(log file) has the format of:

UserId | loginTime | LogoutTime | roomNum | MachineID

This file is huge(couple of TB).

The second file(user file small file about 20MB) is:

UserId | UserFname | UserLname | DOB | Address

I have to find out the frequency of the users's usage of the lab machines, the most frequent user and list their names.

I know how to process one file if everything were in there. Since the user details are in the other folder it is becoming hard for me to process it. I am new to Mapreduce I am seeking your help and advices here. The problems is similar to the Joining two table in RDBMS by foreign key for me.

alekya reddy · Accepted Answer

You can use distributed cache to save the small file. The distributed cache is stored in memory and is distributed across all the clusters running the map reduce tasks.

Add the file to the distributed cache in the following way.

Configuration conf = new Configuration();
DistributedCache.addCacheFile(new URI("/user/xxxx/cacheFile/exmple.txt"), conf);
Job job = new Job(conf, "wordcount");

and get this file from the setup method of your mapper and then play around with this data in your map or reduce method.

public void setup(Context context) throws IOException, InterruptedException{
    Configuration conf = context.getConfiguration();
    Path[] localFiles = DistributedCache.getLocalCacheFiles(conf);
    //etc
}

Alternatively you can use the different mappers to process

How to process two files in Hadoop Mapreduce?

Answers (1)

Related Questions