Reputation: 95
I have a kind having around 5 Million entities in the Google Cloud Datastore. I want to get this count programmatically using Java. I tried following code but it work upto certain threshold (800K). When i ran query for 5 M records, it goes into infinite loop (my guess) since it doesn't returns any count. How to get the count of entities for this big data? I would not like to use Google App Engine API since it requires to setup environment.
private static Datastore datastore;
datastore = DatastoreOptions.getDefaultInstance().getService();
Query query = Query.newKeyQueryBuilder().setKind(kind).build();
int count = Iterators.size(datastore.run(query)); //count has the entities count
Upvotes: 2
Views: 3107
Reputation: 1552
COUNT aggregation in datastore is generally available now.
There are client libraries available in multiple languages which support this particular feature.
With this feature, users can avoid performing the client side aggregations which puts an additional burden of increased egress cost. Also no need of using alternatives like cloud-functions to update the aggregate values on the backend side which as a cost limitation of their own.
Upvotes: 1
Reputation: 5276
Check out Google Dataflow. A pipeline like the following should do it:
def send_count_to_call_back(callback_url):
def f(record_count):
r = requests.post(callback_url, data=json.dumps({
'record_count': record_count,
}))
return f
def run_pipeline(project, callback_url)
pipeline_options = PipelineOptions.from_dictionary({
'project': project,
'runner': 'DataflowRunner',
'staging_location':'gs://%s.appspot.com/dataflow-data/staging' % project,
'temp_location':'gs://%s.appspot.com/dataflow-data/temp' % project,
# .... other options
})
query = query_pb2.Query()
query.kind.add().name = 'YOUR_KIND_NAME_GOES HERE'
p = beam.Pipeline(options=pipeline_options)
_ = (p
| 'fetch all rows for query' >> ReadFromDatastore(project, query)
| 'count rows' >> apache_beam.combiners.Count.Globally()
| 'send count to callback' >> apache_beam.Map(send_count_to_call_back(callback_url))
)
I use python, but they have a Java sdk too https://beam.apache.org/documentation/programming-guide/
The only issue is your process will have to trigger this pipeline, let it run on its own for a few minutes, and then let it hit a callback URL to let you know it's done
Upvotes: 0
Reputation: 2887
How accurate do you need the count to be? For an slightly out of date count you can use a stats entity to fetch the number of entities for a kind.
If you can't use the stale counts from the stats entity, then you'll need to keep counter entities for the real time counts that you need. You should consider using a sharded counter.
Upvotes: 2