yura
yura

Reputation: 14645

How to run hadoop multithread way in single JVM?

I have 4 core desktop and want to use all my cores for local data processing with hadoop. (i.e. sometimes I have enough power to process data locally sometimes I submit same jobs to cluster).

By default hadoop local mode runs only one mapper and one reducer so my local jobs are really slow. I do not want to setup cluster on single machine first because of "painful" configuration and second I have to create jar each time. So perfect solution is to how run embedded Hadoop on a single machine

PS pseudo-distributed mode is bad option since it will create cluster with Single node, so I will get only one mapper and I have to spend some time on additional configuration.

Upvotes: 2

Views: 4255

Answers (3)

Victor
Victor

Reputation: 781

    Configuration conf = new Configuration();

    Job job = new Job(conf, "SolerRandomHit");

    job.setOutputKeyClass(Text.class);

    job.setOutputValueClass(IntWritable.class);


    job.setMapperClass(MultithreadedMapper.class);

Upvotes: 0

rystsov
rystsov

Reputation: 1928

You need to use MultithreadedMapRunner - just set up it in JobConf's setMapRunnerClass method and don't forget to set mapred.map.multithreadedrunner.threads to desirable concurrency level.

Also there is an another way, you should:

  • set MultithreadedMapper as your mapper class in Job-typed object
  • call MultithreadedMapper.setMapperClass with you actual mapper class
  • call MultithreadedMapper.setNumberOfThreads with desirable concurrency level

But be careful, your mapper class should be thread safe and it's setup and cleanup methods would be called several times, so it isn't a smart idea to mix MultithreadedMapper with MultipulOutput, unless you implement you own MultithreadedMapper inspired class.

Upvotes: 5

Joe K
Joe K

Reputation: 18424

Hadoop purposely does not run more than one task at the same time in one JVM for isolation purposes. And in stand-alone (local) mode, only one JVM is ever used. If you want to make use of your four cores, you should run in pseudo-distributed mode, and increase the max number of concurrent tasks to four. You can do this with the mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum properties.

Upvotes: 1

Related Questions