Gustaf Rosenblad
Gustaf Rosenblad

Reputation: 1922

Deleting massive of entities from Google App Engine NDB

The previous guys made som problem in our Google App Engine app. Currently, the app is saving entities with NULL values, but it would be better if we could clean up all thees values.

Here is the ndb.Modal:

class Day(ndb.Model):
    date = ndb.DateProperty(required=True, indexed=True)
    items = ndb.StringProperty(repeated=True, indexed=False)
    reason = ndb.StringProperty(name="cancelled", indexed=False)

    is_hole = ndb.ComputedProperty(lambda s: not bool(s.items or s.reason))

Somehow, we need to delete all Days where is_hole is true.

It's around 4 000 000 entities where around 2 000 000 should be deleted on the server.

Code so far

I thought it would be good to first count how many entities we should delete using this code:

count = Day.query(Day.is_hole != False).count(10000)

This (with the limit of 10 000) takes around 5 seconds to run. Without the limit, it would case a DeadLineException.

For deleting, I've tried this code:

ndb.delete_multi([key for key in Day.query(Day.is_hole != False).fetch(10000, keys_only=True)])

This (with the limit) takes around 30 seconds.

Question

How can I faster delete all Day where is_hole != False?

(We are using Python)

Upvotes: 0

Views: 230

Answers (2)

Sriram
Sriram

Reputation: 9494

Best way is to use MapReduce which will run in task queue and also you can do sharding to parallel the work. Here is the python code. Let me know, if you need any clarification

main.py

from mapreduce import base_handler
from mapreduce import mapreduce_pipeline
from mapreduce import operation as op
from mapreduce.input_readers import InputReader
from google.appengine.api import app_identity

def deleteEntity(entity):
    yield op.db.Delete(entity)

class DeleteEntitiesPipeline(base_handler.PipelineBase):
    def run(self):
        bucket_name = (app_identity.get_default_gcs_bucket_name())
        yield mapreduce_pipeline.MapPipeline(
                "job_name", 
                "main.deleteEntity", 
                "mapreduce.input_readers.DatastoreInputReader", 
                params={
                    "entity_kind": 'models.Day',
                    "filters": [("is_hole", "=", True)],
                    "bucket_name": bucket_name
                },
                shards=5)

class StartDelete(webapp2.RequestHandler):
    def get(self):
        pipeline = DeleteEntitiesPipeline()
        pipeline.start()

application = webapp2.WSGIApplication([
    ('/deleteentities', StartDelete),
], debug=True)

Upvotes: 1

Chameleon
Chameleon

Reputation: 10128

No, there is not faster way to delete entities - deadline is fixed.

But there are some tricks.

  1. You can make deadline longer if you will use https://cloud.google.com/appengine/docs/python/taskqueue/ you can put some task in queue generate next task after first task (recurrence).
  2. Another option similar to task queue is to make after deleting some of bad record redirect to same handler which is deleting while the last record will be deleted. Need browser open till the end.
if at_least_one_bad_record:
  delete_some_records (not longer than 30s)
  spawn again this task or redirect to this handler (next call will have next 30s)

Remember that it has exit point if no more good records. It will delete all matching record without clicking again.

Upvotes: 1

Related Questions