sc_ray
sc_ray

Reputation: 8043

Duplicates from a stream

We have an external service that continuously sends us data. For the sake of simplicity lets say this data has three strings in tab delimited fashion.

datapointA datapointB datapointC

This data is received by one of our servers and then is forwarded to a processing engine where something meaningful is done with this dataset.

One of the requirements of the processing engine is that duplicate results will not be processed by the processing engine. So for instance on day1, the processing engine received A B C, and on day 243, the same A B C was received by the server. In this particular situation, the processing engine will spit out a warning,"record already processed" and not process that particular record.

There may be a few ways to solve this issue:

....or some other technique

Can somebody point out some case-studies or established patterns or practices to solve this particular issue?

Thanks

Upvotes: 0

Views: 73

Answers (2)

andrew cooke
andrew cooke

Reputation: 46872

you need some kind of backing store, for persistence, whatever the solution. so no matter how much work that has to be implemented. but it doesn't have to be an sql database for something so simple - alternative to memcached that can persist to disk

in addition to that, you could consider bloom filters for reducing the in-memory footprint. these can give false positives, so then you would need to fall back to a second (slower but reliable) layer (which could be the disk store).

and finally, the need for idempotent behaviour is really common in messaging/enterprise systems, so a search like this turns up more papers/ideas (not sure if you're aware that "idempotent" is a useful search term).

Upvotes: 1

krystan honour
krystan honour

Reputation: 6793

You could create a hash of the data and store that in a backing store which would be smaller than the actual data (provided your data isn't smaller than a hash)

Upvotes: 1

Related Questions