Reputation: 1922
The previous guys made som problem in our Google App Engine app. Currently, the app is saving entities with NULL values, but it would be better if we could clean up all thees values.
Here is the ndb.Modal:
class Day(ndb.Model):
date = ndb.DateProperty(required=True, indexed=True)
items = ndb.StringProperty(repeated=True, indexed=False)
reason = ndb.StringProperty(name="cancelled", indexed=False)
is_hole = ndb.ComputedProperty(lambda s: not bool(s.items or s.reason))
Somehow, we need to delete all Day
s where is_hole
is true
.
It's around 4 000 000 entities where around 2 000 000 should be deleted on the server.
Code so far
I thought it would be good to first count how many entities we should delete using this code:
count = Day.query(Day.is_hole != False).count(10000)
This (with the limit of 10 000) takes around 5 seconds to run. Without the limit, it would case a DeadLineException
.
For deleting, I've tried this code:
ndb.delete_multi([key for key in Day.query(Day.is_hole != False).fetch(10000, keys_only=True)])
This (with the limit) takes around 30 seconds.
Question
How can I faster delete all Day
where is_hole != False
?
(We are using Python)
Upvotes: 0
Views: 230
Reputation: 9494
Best way is to use MapReduce which will run in task queue and also you can do sharding to parallel the work. Here is the python code. Let me know, if you need any clarification
main.py
from mapreduce import base_handler
from mapreduce import mapreduce_pipeline
from mapreduce import operation as op
from mapreduce.input_readers import InputReader
from google.appengine.api import app_identity
def deleteEntity(entity):
yield op.db.Delete(entity)
class DeleteEntitiesPipeline(base_handler.PipelineBase):
def run(self):
bucket_name = (app_identity.get_default_gcs_bucket_name())
yield mapreduce_pipeline.MapPipeline(
"job_name",
"main.deleteEntity",
"mapreduce.input_readers.DatastoreInputReader",
params={
"entity_kind": 'models.Day',
"filters": [("is_hole", "=", True)],
"bucket_name": bucket_name
},
shards=5)
class StartDelete(webapp2.RequestHandler):
def get(self):
pipeline = DeleteEntitiesPipeline()
pipeline.start()
application = webapp2.WSGIApplication([
('/deleteentities', StartDelete),
], debug=True)
Upvotes: 1
Reputation: 10128
No, there is not faster way to delete entities - deadline is fixed.
But there are some tricks.
if at_least_one_bad_record: delete_some_records (not longer than 30s) spawn again this task or redirect to this handler (next call will have next 30s)
Remember that it has exit point if no more good records. It will delete all matching record without clicking again.
Upvotes: 1