David Freeman
David Freeman

Reputation: 125

Importing data into Google App Engine

Recently I had to import 48,000 records into Google App Engine. The stored 'tables' are 'ndb.model' types. Each of these records is checked against a couple of other 'tables' in the 'database' for integrity purposes and then written (.put()).

To do this, I uploaded a .csv file into Google Cloud Storage and processed it from there in a task queue. This processed about 10 .csv rows per second and errored after 41,000 records with an out of memory error. Splitting the .csv file into 2 sets of 24,000 records each fixed this problem.

So, my questions are:

a) is this the best way to do this?

b) is there a faster way (the next upload might be around 400,000 records)? and

c) how do I get over (or stop) the out of memory error?

Many thanks, David

Upvotes: 1

Views: 90

Answers (2)

snakecharmerb
snakecharmerb

Reputation: 55924

The ndb in-context cache could be contributing to the memory errors. From the docs:

With executing long-running queries in background tasks, it's possible for the in-context cache to consume large amounts of memory. This is because the cache keeps a copy of every entity that is retrieved or stored in the current context. To avoid memory exceptions in long-running tasks, you can disable the cache or set a policy that excludes whichever entities are consuming the most memory.

You can prevent caching on a case by case basis by setting a context option in your ndb calls, for example

foo.put(use_cache=False)

Completely disabling caching might degrade performance if you are often using the same objects for your comparisons. If that's the case, you could flush the cache periodically to stop it getting too big.

if some_condition:
    context = ndb.get_context()
    context.clear_cache()

Upvotes: 2

GAEfan
GAEfan

Reputation: 11370

1) Have you thought about (even temporarily) upgrading your server instances?

https://cloud.google.com/appengine/docs/standard/#instance_classes

2) I don't think a 41000 row csv is enough to run out of memory, so you probably need to change your processing:

a) Break up the processing using multiple tasks, rolling your own cursor to process a couple thousand at a time, then spinning up a new task.

b) Experiment with ndb.put_multi()

Sharing some code of your loop and puts might help

Upvotes: 3

Related Questions