Reputation: 311645
I would like to operate a service that anticipates having subscribers who are interested in various kinds of products. A product is a bag of dozens of attributes:
{
"product_name": "...",
"product_category": "...",
"manufacturer_id": "...",
[...]
}
A subscriber can express an interest in any subset of these attributes. For example, this subscription:
{ [...]
"subscription": {
"manufacturer_id": 1234,
"product_category": 427
}
}
will receive events that match both product_category: 427
and manufacturer_id: 1234
. Conversely, this event:
{ [...]
"event": {
"manufacturer_id": 1234,
"product_category": 427
}
}
will deliver messages to any subscribers who care about:
manufacturer_id
, orproduct_category
, ormanufacturer_id
and that product_category
It is vital that these notifications be delivered as expeditiously as possible, because the subscriptions may have a few hundred milliseconds, or a second at most, to take downstream actions. The cache lookup should therefore be fast.
Question: If one wants to cache subscriptions this way for highly efficient lookup on one or more filterable attributes, what sort of approaches or architectures would allow one to do this well?
Upvotes: 3
Views: 391
Reputation: 449
I would propose sharding as an architecture pattern.
Every shard will listen for all events for all products from the source of the events.
For best latency I would propose two layers of sharding, first layer is geographical (country or city depending on customer distribution), it is connected to the source with low latency connection and it is in the same data center as the second level sharding for this location. Second level sharding is on userId and it needs to be receiving all product events, but handle subscriptions only for it's region.
The first layer has the responsibility to fan out the events to the second layer based on geographical position of the subscriptions. This is more or less a single microservice. It can be done with genral event brokers but considering it is going to be relatively simple we can implement it in golang or C++ and optimize for latency.
For the second layer every shard will respond for a number of users from the location, every shard will receive all the events for all products. A shard will be made from one microservice for subscriptions caching and notify logic and one or more notifications delivery microservices.
The subscriptions microservice stores an in memory cache of the subscriptions and checks every event for subscribed users based on maps. I.e. It stores a map from product field to subcribed userIds for example. For this microservice latency is more important so a custom implementation in golang / C++ should deliver the best latency. The subscriptions microservice should not have it's own db or any external cache as network latency is a just a drag in this case.
The notifications delivery microservices are dependant on where you want to send the notifications, but again golang or C++ can deliver one of the lowest latencies.
The system data is it's subscriptions, the data can be sharded per location and userId the same way as the rest of the architecture. So we can have a single DB per second level shard.
For storage of the product fields delending on how often they change they can be: in the code (presuming very rarely changed or never) or in the dbs, with synchronisation mechanism between the dbs if they are expected to change more often.
Upvotes: 0
Reputation: 7320
The way I would do that is having a key/value table that holds an array for the "subscribers ids" by attribute name = value, like this: (where a,b,c,d,y,z are the subscriber's ids).
{ [...]
"manufacturer_id=1234": [a,b,c,d],
"product_category=427": [a,b,y,z],
[...]
}
In you example your event has "manufacturer_id" = 1234
and "product_category" = 427
, so just search for the subscribers where key = manufacturer_id=1234
or product_category=427
and you'll get arrays of all subscribers you want. Then just "merge distinct" those arrays and you'll have every subscribe id you need.
Or, depending of how complex/smart is the database you are using, you can normalize it, like this:
{ [...]
"manufacturer_id": {
"1234": [a,b,c,d],
"5678": [e,f,g,h],
[...]
},
"product_category": {
"427": [a,b,g,h],
"555": [c],
[...]
},
[...]
}
Upvotes: 0
Reputation: 9545
The answer depends on some factors that you have not described in your scenario. For example, what is the extent of the data? How many products/categories/users and what are the estimated data sizes for these- Megabytes, Gigabytes, Terabytes? Also what is the expected throughput of changes to products/subscriptions and events?
So my answer will be a for a medium size scenario in the Gigabytes range where you can likely fit your subscription dataset into memory on a machine.
In this case the straight forward approach would be to have your events appear on an event bus, for example implemented with Kafka or Pulsar. Then you would have a service that consumes the events as they come in and inquires an in memory data store about the subscription matches. (The in-memory db has to be built/copied on startup and kept up to date from a different event source potentially.)
This in-memory store could be a key-value database like MongoDB for example. It comes with an pure in-memory-mode that gives you more predictable performance. In order to ensure predictable, high performance lookups within the db you need to specify your indexes correctly. Any property that is relevant to the lookup needs to be indexed. Also consider that kv-stores can use compound indexes for speeding up lookups of property combinations. Other in-memory kv-stores that you may want to consider as alternatives are redis or mem-cached. If performance is a critical requirement I would recommend to do trials with different systems where you ingest your dataset, build index and try out the queries you need for comparing lookup times.
So the service can now quickly determine the set of users to notify. From here you have two choices - You could have the same service send out notifications directly, or (what I would probably do) you could separate concerns and have a second service whose responsibility is performing the actual notifications. The communication between those services could again be via a topic on the event bus system.
This kind of setup should easily work up to thousands of events per second with single service instances. If it should happen that the number of events scales to massive sizes you can run multiple instances of your services to improve throughput. For that you'd have to look into organizing consumer groups correctly for multiple consumers.
The technologies for implementing the services are probably not critical, but if I'd knew it has strict performance requirements I would go with a language that potentially has manual memory management. For example Rust or C++. Other alternatives could be languages like golang or java, but you'd have to pay attention to how garbage collection is performed and that it doesn't interfere with your performance requirements.
In terms of infrastructure - For a medium or large size system you would typically run your services in a containerized fashion on a cluster of machines, for example using kubernetes.
If it happens that your system scale is on the smaller side you may not need a distributed setup and instead can deploy the described components/services on a single machine.
With such a setup the expected round trip time from a local client should reliably be in the single digit milliseconds from the time the event comes in and a notification goes out.
Upvotes: 1