LeonBrain
LeonBrain

Reputation: 347

How to create separate python script for uploading data into ndb

Can anyone guide me towards the right direction as to where I should place a script solely for loading data into ndb. As I wish to upload all data into the gae ndb so that the application could perform query on it.

Right now, the loading of data is in my application. I wish to placed it separately from the main application.

Should it be edited in the yaml file?

EDITED

This is a snippet of the entity and the handler to upload the data into GAE ndb. I wish to placed this chunk of code separately from my main application .py. Reason being the uploading of this data won't be done frequently and to keep the codes in the main application "cleaner".

class TagTrend_refine(ndb.Model):
    tag = ndb.StringProperty()
    trendData = ndb.BlobProperty(compressed=True)

class MigrateData(webapp2.RequestHandler):
        def get(self):
        listOfEntities = []
        f = open("tagTrend_refine.txt")
        lines = f.readlines()
        f.close()
        for line in lines:
            temp = line.strip().split("\t")
            data = TagTrend_refine(
                tag = temp[0],
                trendData = temp[1]
            )
        listOfEntities.append(data)
        ndb.put_multi(listOfEntities)

For example if I placed the above code in a file called dataLoader.py, where should I call this script to invoke?

In app.yaml alongside my main application(knowledgeGraph.application)?

- url: /.*
script: knowledgeGraph.application

Upvotes: 1

Views: 170

Answers (2)

Jeffrey Rennie
Jeffrey Rennie

Reputation: 3443

Alex's solution will work, as long as all you data can be loaded in under 1 minute, as that's the timeout for an app engine request.

For larger data, consider calling the datastore API directly from your own computer where you have the source. It's a bit of a hassle because it's a different API; it's not ndb. But it's still a pretty simple API. Here's some code that calls the API: https://github.com/GoogleCloudPlatform/getting-started-python/blob/master/2-structured-data/bookshelf/model_datastore.py

Again, this code can run anywhere. It doesn't need to be uploaded to app engine to run.

Upvotes: 0

Alex Martelli
Alex Martelli

Reputation: 882641

You don't show us the application object (no doubt a WSGI app) in your knowledge.py module, so I can't know what URL you want to serve with the MigrateData handler -- I'll just guess it's something like /migratedata.

So the class TagTrend_refine should be in a separate file (usually called models.py) so that both your dataloader.py, and your knowledge.py, can import models to access it (and models.py will need its own import of ndb of course). (Then of course access to the entity class will be as models.TagTrend_refine -- very basic Python).

Next, you'll complete dataloader.py by defining a WSGI app, e.g, at end of file,

app = webapp2.WSGIApplication(routes=[('/migratedata', MigrateData)])

(of course this means this module will need to import webapp2 as well -- can I take for granted a knowledge of super-elementary Python?).

In app.yaml, as the first URL, before that /.*, you'll have:

url: /migratedata
script: dataloader.app

Given all this, when you visit '/migratedata', your handler will read the "tagTrend_refine.txt" file that you uploaded together with your .py, .yaml, and so on, files in your overall GAE application, and unconditionally create one entity per line of that file (assuming you fix the multiple indentation problems in your code as displayed above, but, again, this is just super-elementary Python -- presumably you've used both tabs and spaces and they show up OK in your editor, but not here on SO... I recommend you use strictly, only spaces, never tabs, in Python code).

However this does seem to be a peculiar task. If /migratedata gets visited twice, it will create duplicates of all entities. If you change the tagTrend_refine.txt and deploy a changed variation, then visit /migratedata... all old entities will stick around and all the new entities will join them. And so forth.

Moreover -- /migratedata is NOT idempotent (if visited more than once it does not produce the same state as running it just once) so it shouldn't be a GET (and now we're on to super-elementary HTTP for a change!-) -- it should be a POST.

In fact I suspect (but I'm really flying blind here, since you see fit to give such tiny amounts of information) that you in fact want to upload a .txt file to a POST handler and do the updates that way (perhaps avoiding duplicates...?). However, I'm no mind reader, so this is about as far as I can go.

I believe I have fully answered the question you posted (though perhaps not the one you meant but didn't express:-) and by SO's etiquette it would be nice to upvote and accept this answer, then, if needed, post another question, expressing MUCH more clearly and completely what you're trying to achieve, your current .py and .yaml (ideally with correct indentation), what they actually do and why you'd like to do something different. For POST vs GET in particular, just study When should I use GET or POST method? What's the difference between them? ...

Upvotes: 1

Related Questions