How to get the total count of entities in a kind in Google Cloud Datastore

I have a kind having around 5 Million entities in the Google Cloud Datastore. I want to get this count programmatically using Java. I tried following code but it work upto certain threshold (800K). When i ran query for 5 M records, it goes into infinite loop (my guess) since it doesn't returns any count. How to get the count of entities for this big data? I would not like to use Google App Engine API since it requires to setup environment.

private static Datastore datastore;

datastore = DatastoreOptions.getDefaultInstance().getService(); 

Query query = Query.newKeyQueryBuilder().setKind(kind).build();

int count = Iterators.size(datastore.run(query)); //count has the entities count

Upvotes: 2

Answers (3)

Prateek Jain

Reputation: 1552

COUNT aggregation in datastore is generally available now.

There are client libraries available in multiple languages which support this particular feature.

With this feature, users can avoid performing the client side aggregations which puts an additional burden of increased egress cost. Also no need of using alternatives like cloud-functions to update the aggregate values on the backend side which as a cost limitation of their own.

Upvotes: 1

Alex

Reputation: 5276

Check out Google Dataflow. A pipeline like the following should do it:

def send_count_to_call_back(callback_url):
    def f(record_count):
        r = requests.post(callback_url, data=json.dumps({
            'record_count': record_count,
        }))
    return f

def run_pipeline(project, callback_url)
    pipeline_options = PipelineOptions.from_dictionary({
        'project': project,
        'runner': 'DataflowRunner',
        'staging_location':'gs://%s.appspot.com/dataflow-data/staging' % project,
        'temp_location':'gs://%s.appspot.com/dataflow-data/temp' % project,
        # .... other options
    })

    query = query_pb2.Query()
    query.kind.add().name = 'YOUR_KIND_NAME_GOES HERE'

    p = beam.Pipeline(options=pipeline_options)
    _ = (p
     | 'fetch all rows for query' >> ReadFromDatastore(project, query)
     | 'count rows' >> apache_beam.combiners.Count.Globally()
     | 'send count to callback' >> apache_beam.Map(send_count_to_call_back(callback_url))
    )

I use python, but they have a Java sdk too https://beam.apache.org/documentation/programming-guide/

The only issue is your process will have to trigger this pipeline, let it run on its own for a few minutes, and then let it hit a callback URL to let you know it's done

Upvotes: 0

Jim Morrison

Reputation: 2887

How accurate do you need the count to be? For an slightly out of date count you can use a stats entity to fetch the number of entities for a kind.

If you can't use the stale counts from the stats entity, then you'll need to keep counter entities for the real time counts that you need. You should consider using a sharded counter.

Upvotes: 2

How to get the total count of entities in a kind in Google Cloud Datastore

Answers (3)

Related Questions