Dung Nguyen
Dung Nguyen

Reputation: 31

Using python and Google cloud engine to process big data

I am an amateur to the world of Python programming and I need help. I have 10GB of data and I have written python codes with Spyder to process the data. a part of codes is provided: The codes are good with a small sample of data. However, with 10GB of data, my laptop cannot handle it so I need to use Google Cloud Engine. How I can upload the data and use Google Cloud Engine to run codes?

import os
import pandas as pd 
import pickle
import glob
import numpy as np
df=pd.read_pickle(r'C:\user\mydata.pkl')
i=2018
while i>=1995:
    df=df[df.OverlapYearStart<=i]
    df.to_pickle(r'C:\user\done\{}.pkl'.format(i))
    i=i-1

Upvotes: 3

Views: 543

Answers (2)

Gabe Weiss
Gabe Weiss

Reputation: 3342

Probably the easiest thing to start digging into, is going to be to use App Engine to run the code itself:

https://cloud.google.com/appengine/docs/python/

And use Google Cloud Storage to hold your data objects:

https://cloud.google.com/storage/docs/reference/libraries#client-libraries-install-python

I don't know what the output of your application is, so depending on what you want to do with the output, Google Compute Engine may be the right answer if AppEngine doesn't quite fit what you're doing.

https://cloud.google.com/compute/

The first two links take you to the documentation on how to get going with Python for AppEngine and Google Cloud Storage.

Edit to add from comments, that you'll also need to manage the memory footprint of your app. If you're really doing everything in one giant while loop, no matter where you run the application you'll have memory problems as all 10GB of your data will likely get loaded into memory. Definitely still shift that into the Cloud IMO, but yeah, that memory will need to get broken up somehow and handled in smaller chunks.

Upvotes: 2

Enrique Zetina
Enrique Zetina

Reputation: 835

I agree with the previous answer, just to complement it you can take a look in AI Platform Notebooks which is a managed service that offers an integrated JupyterLab environment, also has the capacity to pull your data from BigQuery and allow you to scale your application on demand.

On the other hand, I don't know how you have storage your 10GB of data into CSV? in a database? As is mentioned in the first answer Cloud Storage allows you to create buckets to store your data, once the data is in Cloud Storage you may export that data into BigQuery tables to work with that data in your app using Google Cloud App Engine or the earlier suggestion AI Platform Notebooks this will depend of your solution.

Upvotes: 3

Related Questions