Sebastian Küpers
Sebastian Küpers

Reputation: 241

improve NDB query performance

I am seeking for advice, how I can improve this in terms of speed:

My Data-model:

class Events(ndb.Model):
    eventid = ndb.StringProperty(required=True)
    participants = ndb.StringProperty(repeated=True)

The way I try to get the data:

def GetEventDataNotCached(eventslist):
    futures = []
    for eventid in eventslist:
        if eventid is not None:
            ke = database.Events.query(database.Events.eventid == eventid)
            future = ke.get_async(keys_only = True)
            futures.append(future)

    eventskeys = []
    for future in futures:
        eventkey = future.get_result()  
        eventskeys.append(eventkey)

    data = ndb.get_multi(eventskeys)

So I get the keys async and than pass the keys to a "get_multi" - is there any other way to make that faster, as I am still not happy yet with the performance.

In the repeated property there can be up to a couple of hundred strings. There are several 10.000 rows in the Events model. In the eventslist are just a couple of dozens eventids I want to fetch.

Upvotes: 3

Views: 1179

Answers (2)

JasonC
JasonC

Reputation: 349

I have found that the deserialization overhead from the protocol buffer of long lists (i.e., large repeated=True properties) is very poor.

Have you looked at this in appstats? Do you see a large gap of whitespace where no RPC is executing after your get_multi()? That is the deserialization overhead.

The only way I've found to overcome this is to remove the long lists and manage them in a separate model (i.e., avoid the long repeated property lists altogether), but of course, that may not be possible for your use case.

So the big question is: do you really need all the participants when you get the list of events, or can you defer that lookup in some way? E.g., it might be cheaper/faster to fetch all the events synchronously, then kick of async fetches for the participants for each event (from a different model) and combine in memory - perhaps you only need the 25 most recently registered participants or something and thus can limit to cost of your sub-queries?

Upvotes: 5

tesdal
tesdal

Reputation: 2459

An improvement in simplicity and execution speed but not cost could be:

data = database.Events.query(database.Events.eventid.IN(eventslist)).fetch(100)

Next step is to have eventid as the id in key, created like

event = Event(id=eventid, ...)

in which case you do

data = ndb. get_multi(ndb.Key(Event, eventid) for eventid in eventlist)

Which is faster and len(eventlist)*6 times cheaper.

Upvotes: 2

Related Questions