user538578964
user538578964

Reputation: 773

Best practices when running Hadoop MapReduce jobs/Hive scripts/Pig scripts etc

I am interested in understanding how ETL jobs like Hadoop MapReduce jobs/Spark Jobs/Hive scripts/Pig scripts are usually deployed in an on premises production/development environment.

Are they always deployed and run using an orchestrator like Apache Airflow or Apache Oozie?

I'm assuming these jobs are almost never run standalone and are always run using a scheduler even if it is a simple scheduled bash script. Is this accurate?

Would also be extremely helpful if I would be able to get some reading material on this topic as well.

Upvotes: 1

Views: 240

Answers (1)

Ben Watson
Ben Watson

Reputation: 5531

It completely depends and you'll find that most environments use a combination of the two. Anything in production is likely to be scheduled - Hadoop jobs are no different from any other kind of job and people want their production environments to be automated and reliable.

Having said that, I have worked at companies where someone is hired to manually shepherd a critical pipeline from start to finish.

Developers will still need ways to easily and manually run jobs during development, and so such jobs will typically be run standalone.

As an aside, I'm not sure that there are many people still deploying new MapReduce, Pig and Oozie jobs these days. Oozie hasn't had a release since 2019, Pig since 2017, and there's almost no reason to run MapReduce instead of Spark.

Upvotes: 2

Related Questions