What is the recommended way to upgrade a dataproc cluster?

Question

Dataproc seems to be designed to be Stateless / Immutable. Is this assumption correct? Should we just quit right now if we are planning to deploy a Hive/Presto data warehouse?

We are struggling to find any documentation that suggests how one should care for a cluster once has been provisioned?

How to upgrade components?
How to install tools (e.g. Hue etc) after a cluster was established?
How to secure access to data + services once deployed?

The FAQs "Can I run a persistent cluster?" don't really address this either.

The internet is suggesting we should just create a new cluster if we have a problem. As a developer I'm quite happy with the "Minimize State" argument but I work in the enterprise world that like solutions like Hive (and its metadata store), Hue and Zeppellin and want to connect external tools like Tableau into a cluster.

The documentation should really make it clear which use-cases dataproc excels at (Batch, on-demand & short lived workloads) vs things it isn't really designed for (e.g. OLAP)?

Dennis Huo · Accepted Answer

Dataproc indeed provides the most benefit for on-demand use cases, but this isn't necessarily at odds with being used for OLAP. The main idea is that the stateful components can all be separated from the "processing" resources so that you can better adjust resources according to needs at different points in time.

The recommended architecture for your Hive metadata is to keep your Hive metastore backend off the cluster, e.g. in a CloudSQL instance; many are able to use Dataproc in this way with short-lived or semi-short-lived clusters (e.g. keeping a pool of live clusters but deleting/recreating the oldest each day or each week) combined with initialization actions pointing the Hiveserver at CloudSQL: https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/tree/master/cloud-sql-proxy

In this world, the stateful metastore pieces are all in CloudSQL and bulk storage is all in GCS. Some clusters might sync from GCS to local HDFS for performance reasons (especially if running HDFS on local-SSD), but even for interactive OLAP use cases, this isn't usually necessary; running queries directly against GCS works fine too. There are admittedly some performance pitfalls for older formats due to longer round-trip latency to GCS, but a bit of tuning can bring it mostly in-line; here's a (non-google-owned) blog post about Presto on Dataproc going over some of those.

This also provides much easier ways to handle traditional cluster admin; upgrades are just swapping out entire clusters, additional tools should be done in initialization actions for easy reproducibility on new clusters, and you can more easily define security perimeters at a per-cluster granularity.

What is the recommended way to upgrade a dataproc cluster?

Answers (1)

Related Questions