2Big2BeSmall
2Big2BeSmall

Reputation: 1378

Using Data from a file as Hash-Map in Map Reduce job Hadoop

I have a file with 10,000("small file") rows with Key,Value different keys in small file can have the same value.

I have to word count on a different file(big file). buy i need to replace the key from the ("big file") with the Value from the ("small file") -in Mapper.

Only After it count it in reducer.

i would like to achieve it using single map reduce job WITHOUT using pig/hive.

could you help me and guide me how to do it ?

The small file will on hdfs and im not sure how would other nodes would be able to read from it - don't think its even recommended - because the node with the small file will have to work really hard sending data to each map task.

Upvotes: 1

Views: 1644

Answers (1)

Vignesh I
Vignesh I

Reputation: 2221

You could do a mapside join and then count the results in reduce side. Place your small file in the distributed cache so that your data will be available to all the nodes. In your mapper store all the key,value pairs in a java hashmap in the setup method and stream the big file through, then do a join in the map method. So this will yield something like this.

Small file (K,V)

Big file (K1,V1) 

Mapper output (V(key),V1(value))

Then do a count in the reducer based on V(or interchange the key,value pair in the map output to achieve your need.

How to read from a distributed cache:

@Override
        protected void setup(Context context) throws IOException,InterruptedException
        {
            Path[] filelist=DistributedCache.getLocalCacheFiles(context.getConfiguration());
            for(Path findlist:filelist)
            {
                if(findlist.getName().toString().trim().equals("mapmainfile.dat"))
                {

                    fetchvalue(findlist,context);
                }
            }

        }
        public void fetchvalue(Path realfile,Context context) throws NumberFormatException, IOException
        {
            BufferedReader buff=new BufferedReader(new FileReader(realfile.toString()));
           //some operations with the file
        }

How to add a file to distributed cache:

DistributedCache.addCacheFile(new URI("/user/hduser/test/mapmainfile.dat"),conf);

Upvotes: 4

Related Questions