Duplicates from a stream

Question

We have an external service that continuously sends us data. For the sake of simplicity lets say this data has three strings in tab delimited fashion.

datapointA datapointB datapointC

This data is received by one of our servers and then is forwarded to a processing engine where something meaningful is done with this dataset.

One of the requirements of the processing engine is that duplicate results will not be processed by the processing engine. So for instance on day1, the processing engine received A B C, and on day 243, the same A B C was received by the server. In this particular situation, the processing engine will spit out a warning,"record already processed" and not process that particular record.

There may be a few ways to solve this issue:

Store the incoming data in an in-memory HashSet, and set exculsion will indicate the processing status of the particular record. Problems will arise when we have this service running with zero downtime and depending on the surge of data, this collection can exceed the bounds of memory. Also, in case of system outages, this data needs to be persisted someplace.
Store the incoming data in the database and the next set of data will only be processed if the data is not present in the database. This
helps with the durability of the history in case of some catastrophe but there's the overhead of maintaing proper-indexes and aggressive
sharding in the case of performance related issues.

....or some other technique

Can somebody point out some case-studies or established patterns or practices to solve this particular issue?

Thanks

andrew cooke · Accepted Answer

you need some kind of backing store, for persistence, whatever the solution. so no matter how much work that has to be implemented. but it doesn't have to be an sql database for something so simple - alternative to memcached that can persist to disk

in addition to that, you could consider bloom filters for reducing the in-memory footprint. these can give false positives, so then you would need to fall back to a second (slower but reliable) layer (which could be the disk store).

and finally, the need for idempotent behaviour is really common in messaging/enterprise systems, so a search like this turns up more papers/ideas (not sure if you're aware that "idempotent" is a useful search term).

Duplicates from a stream

Answers (2)

Related Questions