Argho Chatterjee
Argho Chatterjee

Reputation: 599

Running Hadoop in Local Mode for Production

I have been working on Hadoop for quite some time now, we all know that we use local mode for building our scripts/jobs and to test them locally. But If we have a scenario where as some of our customers have small data sets and some have larger datasets and we do not want to write two code peices of the business logics-one for local running mode and one for distributed mode running, then how to go about it.

One approach for production deployment would be to give the Local Mode Running of the Hadoop Jobs/Pig/MR jobs to the customers having small data set and give a distributed mode of setup for customers having larger data sets.

My Question is: Giving Local Mode Hadoop Setup on production (because the data size is not very big), is it a good idea?!

Should pseudo-distributed mode be the choice for small data sets in production, I would need some thoughts on limitations faced on each of the approach (Local Mode and pseudo-distributed mode) and if there are any risks involved in deploying the same for productions. Kindly help if anyone has encountered such design challenges.

Kindly advice more..

Thanks

Upvotes: 2

Views: 521

Answers (1)

Sergei Rodionov
Sergei Rodionov

Reputation: 4529

We ship some of our product editions in pseudo-distributed mode and even in local mode in case of extremely slow disks and lack of CPU resources. These configurations are typically installed on virtual machines so what we recommend to customers is scheduled VM backup. This takes care of recovery to some extent.

The important thing is to inform customers of inherent tradeoffs in performance and reliability and at the same time to encourage them to think of current configuration as the right architecture for scalability going forward, should they be satisfied with functionality and overall results on a smaller scale.

We have customers running in pseudo-distributed mode with 1 unscheduled downtime incident over for a 2 year period - it was a power outage at the hardware level. There was some data loss involved due to ungraceful shutdown but it was limited in scope.

One thing we've done for these installations is to schedule an automated major compaction in HBase that is triggered by cron during off-peak hours on a daily basis.

Upvotes: 1

Related Questions