clds
clds

Reputation: 171

Google Datastore: ndb.put_multi not returning

I am currently reinserting some entities from XML files into Google Datastore using the NDB library. The issue I am observing is that sometimes ndb.put_multi() does not seem to return and the script hangs waiting for it.

The code is basically doing the following:

@ndb.toplevel
def insertAll(entities):
    ndb.put_multi(entities)

entities = []
for event, case in tree:
    removeNamespace(case)
    if (case.tag == "MARKGR" and event == "end"):
        # get ndb.Model entities
        tm, app, rep = decodeTrademark(case)

        entities.append(tm)
        for app_et in app:
            entities.append(app_et)
        for rep_et in rep:
            entities.append(rep_et)
        if (len(entities) > 200):
            n_entitites += len(entities)
            insertAll(entities)
            entities = []

if(len(entities) > 0):
    insertAll(entities)

I had noticed this behaviour before but it seems to be pretty nondeterministic, I was wondering if there would be a way to debug this properly and/or set a timeout on the ndb.put_multi so I can at least retry it if it does not return after a given time.

Thanks in advance,

Upvotes: 1

Views: 416

Answers (3)

A.Queue
A.Queue

Reputation: 1572

Based on "App Engine datastore tip: monotonically increasing values are bad" by Ikai Lan.

Monotonically increasing values are those that are stored/read/written/strictly sequentially, like timestamps in logs. In the current Datastore implementation they will be stored/read/written sequentially in the same location/spot and Datastore will not be able to properly split the workload. So when the OPS is high enough and Datastore is not able to grow horizontally you will notice a slowdown. This is called hotspoiting.

On top of that that Datastore creates an index for each indexable property, except for example Text property, which means that you can have various hotspots at some point.

Workaround

One of the workarounds mentioned in the official documentation is to prepend a hash to indexed values:

If you do have a key or indexed property that will be monotonically increasing then you can prepend a random hash to ensure that the keys are sharded onto multiple tablets.

Read more on "High read/write rates to a narrow key range ".

Upvotes: 1

A.Queue
A.Queue

Reputation: 1572

From the previous comments you have left it looks like this application is hitting entity read/write limits, which is 1 op/s. You can read more about Datastore limits here.

As an alternative you can try using Cloud Firestore because it doesn't have some of these limits when used in Datastore mode.

Upvotes: 0

GAEfan
GAEfan

Reputation: 11360

ORIGINAL ANSWER (before OP edit)

Your logic is flawed. insertAll() may never be getting called. Where are app and rep defined? And if they are defined outside this function, why are they in nested loops? Any entities in rep are getting written len(app) * len(tree) times!

Also, what about the case where len(entities) < 200? That is inside 3 nested loops. Surely there will be cases where iterations have len(entities) < 200. Think of the orphaned entities if the total, after all the loops, is 750. You would orphan 150 entities.

At least append this after the loops run, to write the orphaned entities (< 200):

if len(entities) > 0:
    insertAll(entities)

Also try reducing 200 to a smaller value, like 100. Depending on the sizes of the entities, 200 might be too many to finish before timing out.

Have you checked to see if ANY entities are written?

Also, are you sure you understand what an entity is, as used by the datastore? If you are simply pulling strings out of an XML file, those are not entities. rep and app must be lists of datastore entities, and tm must be an actual datastore entity.

UPDATE:

OK, that makes more sense, but you are still orphaning some entities, and have no control over the size of the put_multi(). Instead of if (len(entities) > 200):, you should batch them:

# primitive way to batch in groups of 100
batch_size = 100
num_full_batches = len(entities) // batch_size
remaining_count = len(entities) % batch_size

for i in range(num_full_batches):
    ndb.put_multi(entities[i * batch_size : (i+1) * batch_size])

if remaining_count > 0:
    ndb.put_multi(entities[(i+1) * batch_size:])

If too many entities, you should send this off to a taskqueue

Upvotes: 1

Related Questions