rish0097
rish0097

Reputation: 1094

Hive authorization in Dataproc

Dataproc doesn't have built in integration with Apache Ranger and Apache Sentry. So what is the recommended way of user authorization in Hive?

I'm a newbie at Dataproc, your answers will really help.

Upvotes: 4

Views: 1144

Answers (2)

Milan Ilic
Milan Ilic

Reputation: 93

Hive authorization is disabled by default on Dataproc, so anyone can do anything.

However, there is a way to enable it. I found this property in /etc/hive/conf/hive-default.xml.template

  <property>
    <name>hive.security.authorization.enabled</name>
    <value>false</value>
    <description>enable or disable the Hive client authorization</description>
  </property>

By copying this part into /etc/hive/conf/hive-site.xml and setting value to true I managed to turn on Hive authorization. You just need to run systemctl restart hive-server2.service after that so Hive can pick up the config changes and that's it.

According to Hive documentation, consider adding this property, as well:

<property>
  <name>hive.security.authorization.createtable.owner.grants</name>
  <value>ALL</value>
  <description>the privileges automatically granted to the owner whenever a table gets created. 
   An example like "select,drop" will grant select and drop privilege to the owner of the table</description>
</property>

There are also few more authorization-related properties, but you can ignore them if you are using GCS as the underlying storage. According to GCP support, they won't matter because GCS connector does not really support fine-grained HDFS permissions.

<name>hive.metastore.authorization.storage.checks</name>
<name>hive.metastore.authorization.storage.check.externaltable.drop</name>
...

Upvotes: 0

James
James

Reputation: 2331

This is a good question.

As some background, the overall goal for Cloud Dataproc (and other Cloud services) is to have the security/IAM live at the individual product. In a lot of cases customers who use a lot of Hive eventually switch to BigQuery which has specific controls.

On the cluster level, your cluster will run under a service account and you can switch the service account used by your cluster. This means you can restrict a clusters access to things to which the service account has access - GCS buckets, etc. This scopes that specific cluster to only access a specific set of resources.

From the user level, you can gate access to Dataproc via the Dataproc IAM roles. But, as you note, when someone has access to a cluster, they can effectively utilize anything the cluster has access to.

We usually see customers create a set of projects and service accounts to partition out their security needs. For example, a customer may create three projects, one for sales, one for marketing, and one for developers. All of these accounts have various permissions set and, therefore, their Cloud Dataproc use is inherently scoped.

With that said, this has been an area of focus for longer-term improvement.

(Disclaimer - am the Cloud Dataproc PM)

Upvotes: 6

Related Questions