mbz_slk
mbz_slk

Reputation: 231

Design an algorithm to return number of unique users between given time interval

A Web Server can receive millions of user login request. A user can login multiple times. Design an optimal algorithm/data-structure in terms of time complexity to return total number of unique users between given time intervals.

For e.g : Count total number of unique users between interval t1 & t2 and t2 & t3. Also think about returning total count for overlapping intervals (t1 = 10am, t2 = 10:15am, t3 = 10:30am, return total number of users between 10:10am to 10:20am)

Below is what I am proposing, Would appreciate people's comments?

IMO combination of Hashmap & min heap would be good as an optimal solution.

Hashmap- will have key as user-id and value as node pointer to corresponding node in min heap. Min heap - last-logged-in time as key and value as the user-id. Root will be the user whose log-in time is the oldest. Also at the root store the count of total number of nodes in the min-heap so we can quickly return the count.

  1. When a user logs-in. Lookup in the hashmap with user-id as key. a) If no match then insert the new user into hashmap & a new node for user into the min heap and increment the count stored with the root node of min heap. b) else it's an old user update it's last-logged-in value and do not increment the count in root node of min heap.

  2. Whenever we want to find out the unique users logged between t2-t1 then a. Extract min(root) from the heap and check if current time - last-logged-in time > t2-t1 mins. If it is greater than delete the value from the hashmap & min_heap. b. Repeat the above step (a) until the min element of the heap satisfies current time - last-logged-in time <= t2-t1 mins c. Return the value of the count from root node of the min heap.

But I am not able to nail down the algorithm for overlapping intervals.

Upvotes: 4

Views: 1667

Answers (1)

templatetypedef
templatetypedef

Reputation: 373512

I think there is a much easier way to do this. Consider storing all the data in a balanced binary search tree, where the keys are the login times and the values are the list of all people who logged in at that time (assuming that there can be multiple logins at exactly the same moment in time). From there, you can find all people who logged in during an interval between time T1 and T2 by finding the smallest node in the BST whose time is greater than or equal to T1, then continuously computing the inorder successor of that node until you arrive at a node that is at time strictly after time T2.

Doing a lookup in the BST to find the first node at time greater than or equal to T1 will take time O(log n) in a balanced BST, and computing the inorder successor many times will take time O(k), where k is the total number of matches you report. This takes a total of O(log n + k) time. Since you have to spend at least O(k) time reporting all matches logins in any algorithm, this has a very low overhead.

Alternatively, if you are getting the data in a stream from the server (i.e. new logins are always happening as time evolves), you can just use a standard array to hold all the requests. You can just append new requests to the end of the array. Since time always moves forward, this means that the array is always sorted, so you can use binary search to find the start point of the range. Assuming the data isn't pathologically constructed, you could also use interpolation search to make the lookup times expected O(log log n) rather than O(log n), giving expected O(log log n + k) lookup times when finding k total elements.

As for handling overlapping ranges - there are standard algorithms for taking a collection of ranges and merging them together into a minimal number of nonoverlapping ranges. You can always apply one of those techniques prior to doing lookups to handle this case.

Hope this helps!

Upvotes: 1

Related Questions