dataproc cluster on google cloud

Question

My understanding is that running a Dataproc cluster instead of setting up your own compute engine cluster is that it takes care of installing the storage connector (and other connectors). What else does it do for you?

Patrick Clay · Accepted Answer

The most significant feature of Dataproc beyond a DIY cluster is the ability to submit Jobs (Hadoop & Spark jars, Hive queries etc.) via an API, WebUI and CLI without configuring tricky network firewalls and exposing YARN to the world.

Cloud Dataproc also takes care of a lot of configuration and initialization such as setting up A shared Hive Metastore for Hive and Spark. And allows specifying Hadoop, Spark, etc. properties at boot time.

It boots a cluster in ~90s, which in my experience is faster than most cluster setups. This allows you to tear down the cluster when you are not interested and not have to wait tens of minutes to bring a new one up.

I'd encourage you to look at a more comprehensive list of features.

dataproc cluster on google cloud

Answers (1)

Related Questions