Reputation: 1519
I was recently asked this system design question in an interview:
Let's suppose an application allows users to share articles from 3rd party sites with their connections. Assume all share actions go through a common code path on the app site (served by multiple servers in geographically diverse colos). Design a system to aggregate, in near-real time, the N most shared articles over the last five minutes, last hour and last day. Assume the number of unique shared articles per day is between 1M and 10M.
So I came up with below components:
Now I started talking about how data from existing service tier that handles share events will get to the aggregation servers? Possible solution was to use any messaging queue like Kafka here.
And interviewer asked me why you chose Kafka here and how Kafka will work like what topics you will create and how many partitions will it have. Since I was confuse so couldn't answer properly. Basically he was trying to get some idea on point-to-point vs publish-subscribe or push vs pull model?
Now I started talking about how Aggregation service operates. One solution I gave was to keep a collection of counters for each shared URL by 5 minute bucket for the last 24 hours (244 buckets per URL) As each share events happens, increment the current bucket and recompute the 5 min, hour, and day totals. Update Top-N lists as necessary. As each newly shared URL comes in, push out any URLs that haven't been updated in 24 hours. Now I think all this can be done on single machine.
Interviewer asked me can this all be done on one machine? Also can maintenance of 1M-10M tracked shares be done on one machine? If not, how would you partition? What happens if it crashes and how will you recover? Basically I was confuse how Aggregation service will actually work here? How it is getting data from Kafka and what is going to do actually with those data.
Now for data store part, I don't think we need persistent data store here so I suggested we can use Redis with partitioning and redundancy.
Interviewer asked me how will you partition and have redundancy here? And how Redis instance will get updated from the entire flow and how Redis will be structured? I was confuse on this as well. I told him that we can write output from Aggregation service to these redis instance.
There were few things I was not able to answer since I am confuse on how the entire flow will work. Can someone help me understand how we can design a system like this in a distributed fashion? And what I should have answered for the questions that interviewer asked me.
Upvotes: 2
Views: 1328
Reputation: 15824
The intention of these questions is not to get ultimate answer for the problem. Instead check the competence and thought process of the interviewee. There is no point to be panic while answering these kind questions while facing tough follow up questions. Intention of the follow up questions is to guide you or give some hint for the interviewee.
I will try to share one probable answer for this problem. Assume I have s distributed persistent system like Cassandra. And I am going to maintain the status of sharing at any moment using my Cassandra infrastructure. I will maintain a Redis cluster ahead of persistence layer for LRU caching and maintain the buckets for 5 minutes, 1 hour and a day. Eviction will be configured using expire set. Now my aggregator service only need to address minimal data present within my Redis LRU cache. Set up a high through put distributed Kafka cluster will pump data from shared handler. And Kafka feed the data to Redis cluster and from there to Cassandra. To maintain the near real time output, we have to maintain the Kafka cluster throughput matching with it.
Upvotes: 2