nanojack
nanojack

Reputation: 31

Save data from Dataproc to Datastore

I have implemented a recommendation engine using Python2.7 in Google Dataproc/ Spark, and need to store the output as records in Datastore, for subsequent use by App Engine APIs. However, there doesn't seem to be a way to do this directly.

There is no Python Datastore connector for Dataproc as far as I can see. The Python Dataflow SDK doesn't support writing to Datastore (although the Java one does). MapReduce doesn't have an output writer for Datastore.

That doesn't appear to leave many options. At the moment I think I will have to write the records to Google Cloud Storage and have a separate task running in App Engine to harvest them and store in Datastore. That is not ideal- aligning the two processes has its own difficulties.

Is there a better way to get the data from Dataproc into Datastore?

Upvotes: 0

Views: 865

Answers (2)

nanojack
nanojack

Reputation: 31

I succeeded in saving Datastore records from Dataproc. This involved installing additional components on the master VM (ssh from the console)

The appengine sdk is installed and initialised using

sudo apt-get install google-cloud-sdk-app-engine-python
sudo gcloud init

This places a new google directory under /usr/lib/google-cloud-sdk/platform/google_appengine/.

The datastore library is then installed via

sudo apt-get install python-dev
sudo apt-get install python-pip
sudo pip install -t /usr/lib/google-cloud-sdk/platform/google_appengine/ google-cloud-datastore

For reasons I have yet to understand, this actually installed at one level lower, i.e. in /usr/lib/google-cloud-sdk/platform/google_appengine/google/google, so for my purposes it was necessary to manually move the components up one level in the path.

To enable the interpreter to find this code I had to add /usr/lib/google-cloud-sdk/platform/google_appengine/ to the path. The usual BASH tricks weren't being sustained, so I ended up doing this at the start of my recommendation engine.

Because of the large amount of data to be stored, I also spent a lot of time attempting to save it via MapReduce. Ultimately I came to the conclusion that too many of the required services were missing on Dataproc. Instead I am using a multiprocessing pool, which is achieving acceptable performance

Upvotes: 3

James
James

Reputation: 2331

In the past, the Cloud Dataproc team maintained a Datastore Connector for Hadoop but it was deprecated for a number of reasons. At present, there are no formal plans to restart development of it.

The page mentioned above has a few options and your approach is one of the solutions mentioned. At this point, I think your setup is probably one of the easiest ones if you're committed to Cloud Datastore.

Upvotes: 1

Related Questions