Reputation: 587
You work for Zynga, and want to count the number of currently active players for different games. Your web server handles pings from many different games and each user has a unique GUID. Must be able to query number of active users for one game at a time. Active users are those that gotten ping in last minute.
The log lines come in continuously into a web server:
10.1.12.13 - - "http://zynga.com/ping?guid=<guid>&game=<gameID>" -
What is the quickest/easiest way to count active users? Please suggest a 45 minutes answer with some code.
My version
// web server interface, every time ping comes in count() will be called
// void count(String gameId, String guid)
// int getNumberActivePlayers(String gameId)
struct Record{
String gameID;
String guid;
};
class PingStorage{
private:
max_heap<long, Record> storage;
public:
// O(log(n))
// n = total number of elements in storage
void count(String gameId, String guid){
long currentTimeStamp = getUnixTimeStamp();
Record rec ;
rec.gameId = gameId;
rec.guid = guid;
storage.add(currentTimeStamp, rec);
}
//N = numner of records in last ,minutes in storage
//O(N)
int getNumberActivePlayers(String gameId){
map<String, Set<string> > game2user;
long tillTimeStamp = getUnixTimeStampNow() - 60;
while(true){
pair<long, Record> rec = storage.getMax(); //O(1)
if(rec.first <= tillTimeStamp) break;
Set<String> temp = game2user[rec.gameid]; //O(1)
temp.add(rec.userid); //O(log(N)) - O(1)
}
return game2user[gameID].size();
}
};
Upvotes: 16
Views: 1861
Reputation: 32542
My approach would be to use a deque (called queue in the remainder of this post) that all GUIDs are pushed to that are observed, i.e., which is thus ordered by age. In addition I would use a hashmap that contains the pointers to the entries of any GUID present in the queue.
When a new GUID is pushed to the queue, the old entry (if any) would be looked up in the hashmap, removed from the queue, and the new one assigned instead to the hashmap.
As time proceeds, all entries in the queue that surpass the age threshold, would to be popped out (and removed from the hashmap).
The length of the queue (a.k.a. number of active users) could be kept track of as a separate variable, to avoid hopping through the queue for every query.
To support multiple games just add such structure for every gameID.
Complexity: O(1) insert/deletion of observations (given a perfect hash, i.e. no collisions), O(1) query, O(n) space.
Upvotes: 3
Reputation: 137512
This would be my sequence of answers:
zcount
command. This is useful and reliable.Upvotes: 0
Reputation: 34695
Assuming this is a realtime solution, you can handle ping requests in O(1), generate the current player statistic in O(1), and using O(num_player) space by sacrificing some accuracy. The key is to discretize timings.
Overview
The basic idea is to represent discrete time intervals as objects and store in those objects the following property: the number of distinct players that pinged during this time interval that have not since pinged. To query the number of active users, calculate a weighted sum of the last x time intervals that comprise the last minute.
Details
First, select an acceptable temporal resolution. In this example, I choose 15-second intervals.
Maintain five PingInterval data structures to represent five of these intervals (spanning 1 more intervals than 1 minute). PingInterval contains one property: a counter. These PingIntervals are maintained in a PingMonitor. Each time a player pings, update a map within PingMonitor that maps each player to the current time interval. When you perform this mapping, take the following steps, which maintain the counts within the PingIntervals (according to the characteristics I described in the overview section).
(If the PingInterval to represent the current time doesn't exist yet, set the oldest PingInterval to null, create a new PingInterval in a thread-safe way, and then continue as normal.)
When you want to query the number of active users, calculate a time-passed-weighted sum of the last five intervals time intervals. For example, if you are only 5 seconds into the current time interval (meaning the next 10 seconds of that interval have not happened yet), calculate this value: 2/3 * oldest interval + sum of the 4 newest intervals.
Other thoughts
Five intervals is very conservative; we could scale up the number considerably for greater accuracy (maybe one per second) and it would still afford us a significant savings. The important part is that our times are now discrete intervals. This means that when we go to count the number of active users, we don't have to look at every individual time (which is equal to the number of users); instead we can look at x bins of time that we've predefined.
Upvotes: 7
Reputation: 39079
EDIT: I assumed this question wasn't about getting a real-time answer for the question "how many users are active now", but rather getting historical values - how many users were active at 3:25PM. I'm keeping the old solution below the new one:
So, you want to know how many users are active now, Keep a queue per game. Whenever you see a new log entry, find out which game it belongs to, and add it to the game's queue. After each addition clean up old entries at the beginning of the queue (all entries older than 1 minute at the time of the clean-up).
When asked for the number of active users in a game, do the same clean-up on the game's queue, and return the queue's depth.
Keep a hash mapping the games to the queues, and you got an O(N) operation with N being the number of lines in the log - each line is processed at most twice - once for adding it, once for removing it. You also do an additional comparison per addition and look-up (when deciding a queue entry is not old enough), but that's constant time times N. So O(N) in all.
Previous answer to that other question: Seeing there aren't all that many minutes (1440 a day), I would create a vector for each game, with a slot for each minute.
Go over the log file, for each line get the time, round it up to the closest minute and add 1 to the appropriate slot in the array. Once you're done, you'll know exactly how many active users were for each game at every minute.
Complexity - O(N), with N being the number of lines in the log file.
To support multiple games, just use a hash to map from the game's name to its vector.
Now, this assumes you only check for active users at whole minute boundries (1:00:00, 1:01:00, etc...). This is probably what you're required to do anyway.
Upvotes: 0