Reputation: 1000
We are currently in a setup where Data Exploration and Production jobs (which are run from a Production workflow) is in a single Data workspace.
The Production jobs all refer to notebooks within a specific folder within the Databricks workspace, to which we have restricted access and notebooks within this folder is deployed using a CI/CD process. The Databricks jobs are also created from the same CI/CD pipeline . Basically, the jobs are described in a json format and authentication info to connect to the Data lake (where data is stored) is part of the cluster info which is created (on the fly).
The permission settings for these jobs is also part of the same CI/CD process, which sets the permissions on these jobs , so that non-admin users have only "view" permissions.
Now, all that works great.
Now, non-users are "advised" to create jobs through the pipeline, but should they want they can very well create jobs themselves in an adhoc way and there is no way to stop them from doing so. Creating the jobs themselves makes them the owner. And as result, what happens is: they can "potentially" copy the same spark config, which has "write" rights to the curated zone in the data lake, which is a security threat. We have defined ACL(s) in our data Lake , so that non admins have write access to only the "sandbox" filesystem.
But since, they can view the spark configs of the Production jobs (to which they have view access), they can very well copy the same configs as part of the cluster config, for the adhoc jobs that they can "potentially create".
I have now decided to have a separate workspace for the Production jobs, just so to have separation of duties. Earlier we had this but then came in ML Flow and back in past, we could not share MLFLow Registry but now we can , which is great.
But still the problem is that the same users, would still need access to this new workspace , because we want them to monitor the jobs themselves. Also, they would need access to get the "job_id" of the deployed jobs from the CI/CD pipeline, so that they can use it to include in the Airflow pipeline (from where we orchestrate job pipelines).
So, basically back to kind of "square one" (though I would still want a separate workspace for Production jobs).
I have seen this but this apparently does not enough votes (I have still upvoted this) : here
Just to give an example and give more clarity to how jobs are defined in our repo and how spark configs are defined, which are finally rolled out through our CI/CD process (which can be potentially copied when creating adhoc jobs by the non admin users)-
{
"name": "ClientStateVector",
"new_cluster": {
"spark_version": "7.3.x-scala2.12",
"node_type_id": "Standard_F32s_v2",
"driver_node_type_id": "Standard_DS4_v2",
"num_workers": 3,
"spark_conf": {
"spark.hadoop.fs.azure.account.oauth2.client.endpoint.datalakename.dfs.core.windows.net": "https://login.microsoftonline.com/tenantid/oauth2/token",
"spark.databricks.delta.preview.enabled": "true",
"spark.hadoop.fs.azure.account.auth.type.datalakename.dfs.core.windows.net": "OAuth",
"spark.hadoop.fs.azure.account.oauth.provider.type.datalakename.dfs.core.windows.net": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"spark.hadoop.fs.azure.account.oauth2.client.secret.datalakename.dfs.core.windows.net": "{{secrets/DatalakeKeySec/clientSecret}}",
"spark.hadoop.fs.azure.account.oauth2.client.id.datalakename.dfs.core.windows.net": "{{secrets/DatalakeKeySec/clientID}}"
}
},
"libraries": [
{ "whl": "dbfs:/artifacts/client-state-vector/1.0.47/client_state_vector-1.0.0-py3-none-any.whl" }
],
"notebook_task": {
"notebook_path": "/JobNotebooks/DataScienceNotebooks/ClientStateVector/bootstrap"
}
}
What is the best way to allow non admins view rights but still ensure that they are not able to create jobs with the same spark configs OR not be able to create jobs at all?
I understand that it could be a possibility to have webhooks but surely it has to be simpler. Or is it that I am missing/unaware of something?
Upvotes: 0
Views: 1141
Reputation: 1000
Ok, over these past few days , I found out that a job always runs on the creator /owner's credentials. So, even if someone copies the configs over, if they don't have access to those secret scopes , the corresponding or potential jobs they create, would fail. It would fail, even if it is triggered by an Admin or an Admin token.
So, though restrictions to create jobs would be great, if the above statements hold true, inadvertent access can be avoided.
Upvotes: 1