Reputation: 285
We work on jupyter notebooks interchangeably between Dataproc and on local computer. Usually, we write and test the code on a smaller sample locally and run it on all of the data in Dataproc. However, the way we are doing it currently is by downloading/uploading the notebooks between Google Cloud Storage and the local computer, which is not optimal for several reasons. We also have a Github repository connected to the folder on the local computer. Is it possible to clone a Github repository to GCS and work with git from there?
We found a workaround using initialization actions when we create a cluster:
gcloud dataproc clusters create test-init-actions \
--enable-component-gateway \
--bucket {bucket-name} \
--single-node \
--image-version=2.1.0-RC2-debian11 \
--optional-components JUPYTER \
--project {project-name} \
--initialization-actions=gs://{project-name}/initialization-actions/clone-public-repo.sh
Where the content of clone-public-repo.sh
is just (we'll extend this to a private repository):
git clone https://github.com/{user}/{repo-name}
This clones the repository to the cluster's local storage and we can use git normally from there. The problem with this approach is that the local changes to the notebooks in Dataproc are not persisted if the cluster is deleted. So we'd always have to commit and push before deleting the cluster. This would result in committing unfinished code just for the sake of persistence and possible loss of progress if a developer forgets to commit and push.
Is there a way that the local changes in the cluster's disk are persisted somewhere else, for example GCS, without manually storing the files there?
Edit: We would like to use git in both ways, for cloning to the cluster and committing the changed code back to the Github repository. Cloning the repository into cluster's local storage allows that, but we run the risk of losing uncommitted changes if the cluster is deleted. On the other hand, mirroring the repository on GCS (using, for example, Github actions) and working with the notebook from there the changes will be persisted on GCS, but we would be unable to commit/push changes back to the Github repository.
Upvotes: 2
Views: 727
Reputation: 1485
It is possible to enable Github actions on Cloud Storage using the steps mentioned in this documentation . For your requirement to save the changes in Cloud Storage, ensure it is able to save your checkpoints
Upvotes: 1