Reputation: 98
I have to remove large amount of features (about 100 mln records) from Geomesa data store as fast as possible. I tried to use:
String cql = DATE_TIME_FIELD + " BEFORE " +
strCurrentDateTime + ") AND " + "(" + TIMING_FIELD + " > 0)";
Filter filter = CQL.toFilter(cql);
featureStore.removeFeatures(filter)
However it works too slow. Both DATE_TIME_FIELD and TIMING_FIELD have indexes. Is there some another ways?
Thank you!
Upvotes: 1
Views: 457
Reputation: 1624
I would suggest parallelizing your deletes, the same way you would parallelize ingest code. For deletes, you would need to break up your CQL filter into discrete parts, e.g. (in pseudo code) dtg between now/1 hour ago
, dtg between 1 hour ago/2 hours ago
, etc.
Deletes are slower than inserts for the following reasons:
Parallelizing the deletes will help with the first two items, but not the database maintenance. So your database may still end up struggling under the load.
You should also ensure that the more discriminating index is being used between DATE_TIME_FIELD
and TIMING_FIELD
. You can do this by setting cardinality hints as described here:
http://www.geomesa.org/documentation/user/datastores/index_basics.html#cardinality-hints
Upvotes: 0
Reputation: 1355
Generally, the distributed databases that GeoMesa leverages are optimized for inserts. Deleting large numbers of records will cause a number of minor and major compactions.
Compounding the problem, each index writes additional entries for each record which increases the number of things to delete.
In the case where one wanted to delete an entire table/feature type, that usually works out ok.
Potentially, if deleting millions of records would come up frequently, one could write bulk deletion helpers for the underlying datastore. (As an example, this kind of delete might be trivial using the GeoMesa filesystem with certain configurations.)
Upvotes: 1