GAE Datastore read performance

Question

We neeed to process thousands of timeseries entities regularly and we have performance issues reading that much data from the Datastore, the processins is computationaly light and does not cause issues. We created a synthetic test simulating real server traffic where we test with 25k entitites.

We use the Java runtime and Objectify (5.1.1 and 5.1.8) to access the Datastore.

The entity

@Entity(name="logs")
@Cache
public class Log {
    @Id
    public Long id;

    @Index
    public Ref user;

    public String deviceId;
    public String nonce;
    public String version;

    public String data;

    @Index
    public Date timestamp;

    @OnSave
    private void prePersist() {
        if (timestamp == null) {
            timestamp = new Date();
        }
    }
}

The query

query = ofy().load().type(Log.class).
        filter("timestamp >", startDate).
        order("timestamp").
        limit(25000);

We tried different loading of the entities. First query.list() then ofy().load().keys(query.keys()) so the look-up will go through GAE's memcache, but the results are the same. Retrieving 25k entities takes alaways around 8 seconds (measured via System.nanoTime()). In the case of query.list(), that call itself is fast but iterating over the entities is slow. Looks like that the entity is retrieved from the Datastore in that very moment and not in query.list(). All this is a simple servlet on an F4 frontend instance with dedicated memcache, no task.

Reading 25k entities is just a test to get some numbers about our server implementation's performance. In real worl we expect to read up to 500k entities at once, is this doable under 30-60 seconds with GAE's Datastore and dedicated memcache? In 2 years it could be millions of entities.

Another issue is limited RAM, but that is solvable via GAE's Managed VMs or GCE.

The questions is what is the fastest way to retrieve timeseries entities from the Datastore + dedicated memcache with Objectify. Looks like the memcache does not help Objectify in our case. The memcache has tens of thousands of Objectify items inside but the loading time is the same as with empty memcache. Objectify's/Datastore's best practices is to make batch get operations, how to achieve that? Is Objectify doing this under the hood with our entity and query or do we have to change something? Can the low-level Datastore API help us improve reading performance? Thank you.

EDIT We are already working on merging the logs so every log entity will hold multiple current logs. That will give us around 10x reed improvement, which is still not enough for hundreds of thousands of records.

stickfigure · Accepted Answer

This solution is unlikely to scale the way you want.

Querying for @Cache entities defaults to a "hybrid" keys-only query (which is blazing fast) followed by a batch get (which is comparatively slow). If the cache is warm this can perform pretty well but probably not to the scale you are talking about. And eventually, even with dedicated memcache, the cache will be reset - then your operations will probably timeout and fail a few times until the cache is warmed up again.

You can disable this hybrid feature: ofy().load().hybrid(false) or just by removing the @Cache annotation. A regular query will perform significantly better with a cold cache. You can also try changing the chunk() size to something larger. The default is something small like 20.

Managed VM access to the datastore through the standard API is (currently) significantly slower than access from within Classic GAE. This may cause problems at this scale.

The datastore is generally poorly suited to operations that involve bulk reads and writes of huge numbers of entities. It also tends to be very expensive for that purpose. You might consider using the datastore as a reliable "master" copy and index the data in other slave databases that use clustered indexes. Or, depending on your durability requirements, just use the secondary datastore as a master copy.

GAE Datastore read performance

Answers (1)

Related Questions