Brendan Martin
Brendan Martin

Reputation: 639

How to put a dataset on a gcloud kubernetes cluster?

I have a gcloud Kubernetes cluster initialized, and I'm using a Dask Client on my local machine to connect to the cluster, but I can't seem to find any documentation on how to upload my dataset to the cluster.

I originally tried to just run Dask locally with my dataset loaded in my local RAM, but obviously that's sending it over the network and the cluster is only running at 2% utilization when performing the task.

Is there a way to put the dataset onto the Kubernetes cluster so I can get 100% CPU utilization?

Upvotes: 3

Views: 320

Answers (1)

MRocklin
MRocklin

Reputation: 57251

Many people store data on a cloud object store, like Amazon's S3, Google Cloud Storage.

If you're interested about Dask in particular these data stores are supported in most of the data ingestion functions by using a protocol like the following:

import dask.dataframe as dd
df = dd.read_csv('gcs://bucket/2018-*-*.csv')

You will need to also have the relevant Python library installed to access this cloud storage (gcsfs in this case). See http://dask.pydata.org/en/latest/remote-data-services.html#known-storage-implementations for more information.

Upvotes: 1

Related Questions