Master Yoda
Master Yoda

Reputation: 587

What is the quickest/easiest way to count active users in last one minute?

You work for Zynga, and want to count the number of currently active players for different games. Your web server handles pings from many different games and each user has a unique GUID. Must be able to query number of active users for one game at a time. Active users are those that gotten ping in last minute.

The log lines come in continuously into a web server:

10.1.12.13 - - "http://zynga.com/ping?guid=<guid>&game=<gameID>" -

What is the quickest/easiest way to count active users? Please suggest a 45 minutes answer with some code.


My version

// web server interface, every time ping comes in count() will be called
// void count(String gameId, String guid)
// int getNumberActivePlayers(String gameId)

struct Record{
  String gameID;
  String guid;
};

class PingStorage{
private:
  max_heap<long, Record> storage;
public:
  //    O(log(n))
  //  n = total number of elements in storage
  void count(String gameId, String guid){
    long currentTimeStamp = getUnixTimeStamp();
    Record rec ;
    rec.gameId = gameId;
    rec.guid = guid;
    storage.add(currentTimeStamp, rec);
  }
  //N = numner of records in last ,minutes in storage
  //O(N)
  int getNumberActivePlayers(String gameId){
    map<String, Set<string> > game2user;
    long tillTimeStamp = getUnixTimeStampNow() - 60;
    while(true){
      pair<long, Record> rec = storage.getMax(); //O(1)
      if(rec.first <= tillTimeStamp) break;  
      Set<String> temp = game2user[rec.gameid]; //O(1)
      temp.add(rec.userid); //O(log(N)) - O(1)
    }
    return game2user[gameID].size();
  }
};

Upvotes: 16

Views: 1861

Answers (4)

moooeeeep
moooeeeep

Reputation: 32542

My approach would be to use a deque (called queue in the remainder of this post) that all GUIDs are pushed to that are observed, i.e., which is thus ordered by age. In addition I would use a hashmap that contains the pointers to the entries of any GUID present in the queue.

When a new GUID is pushed to the queue, the old entry (if any) would be looked up in the hashmap, removed from the queue, and the new one assigned instead to the hashmap.

As time proceeds, all entries in the queue that surpass the age threshold, would to be popped out (and removed from the hashmap).

The length of the queue (a.k.a. number of active users) could be kept track of as a separate variable, to avoid hopping through the queue for every query.

To support multiple games just add such structure for every gameID.

Complexity: O(1) insert/deletion of observations (given a perfect hash, i.e. no collisions), O(1) query, O(n) space.

Upvotes: 3

Colonel Panic
Colonel Panic

Reputation: 137512

This would be my sequence of answers:

  1. Why bother? It's easiest to count minute-by-minute how many users are active. Isn't it enough to know that?
  2. If you really care about up-to-date information, let's count in second-by-second bins (as described by Cheeken). This will be accurate up to a fraction of a second.
  3. Ok, if real-time accuracy is "necessary", and you want to interview me about data structures, let's use a heap of customers scored by last activity time (as described by Master Yoda).
  4. If real-time accuracy is necessary, and we were to do this in production, let's use the data structure server Redis. We maintain a sorted set of customers scored by last activity time. We can query how many customers were active in the last minute or in the last hour with the zcount command. This is useful and reliable.

Upvotes: 0

cheeken
cheeken

Reputation: 34695

Assuming this is a realtime solution, you can handle ping requests in O(1), generate the current player statistic in O(1), and using O(num_player) space by sacrificing some accuracy. The key is to discretize timings.

Overview

The basic idea is to represent discrete time intervals as objects and store in those objects the following property: the number of distinct players that pinged during this time interval that have not since pinged. To query the number of active users, calculate a weighted sum of the last x time intervals that comprise the last minute.

Details

First, select an acceptable temporal resolution. In this example, I choose 15-second intervals.

Maintain five PingInterval data structures to represent five of these intervals (spanning 1 more intervals than 1 minute). PingInterval contains one property: a counter. These PingIntervals are maintained in a PingMonitor. Each time a player pings, update a map within PingMonitor that maps each player to the current time interval. When you perform this mapping, take the following steps, which maintain the counts within the PingIntervals (according to the characteristics I described in the overview section).

  • If the player is already mapped to an interval and it is the current interval, do nothing.
  • Else, if the player is mapped to an interval that isn't the current interval,
    • decrement the old interval's count,
    • increment the current interval's count,
    • and map that player to that interval.
  • Else, if the player is not mapped to an interval at all,
    • increment the current interval's count,
    • map the player to the current interval.

(If the PingInterval to represent the current time doesn't exist yet, set the oldest PingInterval to null, create a new PingInterval in a thread-safe way, and then continue as normal.)

When you want to query the number of active users, calculate a time-passed-weighted sum of the last five intervals time intervals. For example, if you are only 5 seconds into the current time interval (meaning the next 10 seconds of that interval have not happened yet), calculate this value: 2/3 * oldest interval + sum of the 4 newest intervals.

Other thoughts

Five intervals is very conservative; we could scale up the number considerably for greater accuracy (maybe one per second) and it would still afford us a significant savings. The important part is that our times are now discrete intervals. This means that when we go to count the number of active users, we don't have to look at every individual time (which is equal to the number of users); instead we can look at x bins of time that we've predefined.

Upvotes: 7

zmbq
zmbq

Reputation: 39079

EDIT: I assumed this question wasn't about getting a real-time answer for the question "how many users are active now", but rather getting historical values - how many users were active at 3:25PM. I'm keeping the old solution below the new one:

So, you want to know how many users are active now, Keep a queue per game. Whenever you see a new log entry, find out which game it belongs to, and add it to the game's queue. After each addition clean up old entries at the beginning of the queue (all entries older than 1 minute at the time of the clean-up).

When asked for the number of active users in a game, do the same clean-up on the game's queue, and return the queue's depth.

Keep a hash mapping the games to the queues, and you got an O(N) operation with N being the number of lines in the log - each line is processed at most twice - once for adding it, once for removing it. You also do an additional comparison per addition and look-up (when deciding a queue entry is not old enough), but that's constant time times N. So O(N) in all.

Previous answer to that other question: Seeing there aren't all that many minutes (1440 a day), I would create a vector for each game, with a slot for each minute.

Go over the log file, for each line get the time, round it up to the closest minute and add 1 to the appropriate slot in the array. Once you're done, you'll know exactly how many active users were for each game at every minute.

Complexity - O(N), with N being the number of lines in the log file.

To support multiple games, just use a hash to map from the game's name to its vector.

Now, this assumes you only check for active users at whole minute boundries (1:00:00, 1:01:00, etc...). This is probably what you're required to do anyway.

Upvotes: 0

Related Questions