Reputation: 6326
Say we have about 1e10
lines of log file everyday, each one contains: an ID number(integer below 15 digits length), a login time, and a logout time. Some ID may login and logout several times.
Question 1:
How to count the total number of ID that have logged in?(We should not count each ID twice or more)
I tried to use a hashtable here, but I found the memory we should obtained may be so large.
Question 2:
Calculate the time when the population of online users are largest.
I think we may split the time of a day into 86400 seconds, then for each line of log file, add 1 to each seconds in the online interval. Or maybe I can sort the log file by login time?
Upvotes: 2
Views: 255
Reputation: 5919
you can do that in a *nix shell.
cut -f1 logname.log | sort | uniq | wc -l
cut -f2 logname.log | sort | uniq -c | sort -r
Upvotes: 6
Reputation: 14750
use a segment tree to store intervals of consecutive ids. Scan the logs for all the login events. To insert an id, first search a segment containing the id: if it exists, the id is a duplicate. If it doesn't search the segments which are right after or before the id. If they exist, remove them and merge the new id as needed, and insert the new segment. If they don't exist, insert the id as a segment of 1 element.
Once all ids have been inserted, count their number by summing the cardinals of all the segments in the tree.
assuming that:
Scan the log and keep a counter c
of the number of currently logged in users, as well as the max number m
found, and the associated time t
. For each log in, increment c
, and for each log out decrement it. At each step update m
and t
if m
is lower than c
.
Upvotes: 0
Reputation: 108129
For question 1, forget C++ and use *nix tools. Assuming the log file is space delimited, then the number of unique logins in a given log is computed by:
$ awk '{print $1}' foo.log | sort | uniq | wc -l
Gnu sort, will happily sort files larger than memory. Here's what each piece is doing:
Upvotes: 1
Reputation: 12152
For question 2 to make sense: you probably have to log 2 things: user logs in and user logs out. Two different activities along with the user id. If this list is sorted by the time in which the activity (either log in or log out was done). You just scan with a counter called currentusers: add 1 for each log in and -1 for each log out. The maximum that number (current users) reaches is the value you're interested in, you will probably be interested also in tracking at what time it occurred..
Upvotes: 1
Reputation: 419
For 1, you can try working with fragments of the file at a time that are small enough to fit into memory. i.e. instead of
countUnique([1, 2, ... 1000000])
try
countUnique([1, 2, ... 1000]) +
countUnique([1001, 1002, ... 2000]) +
countUnique([2001, 2002, ...]) + ... + countUnique([999000, 999001, ... 1000000])
2 is a bit more tricky. Partitioning the work into manageable intervals (a second, as you suggested) is a good idea. For each second, find the number of people logged in during taht second by using the following check (pseudocode):
def loggedIn(loginTime, logoutTime, currentTimeInterval):
return user is logged in during currentTimeInterval
Apply loggedIn to all 86400 seconds, and then maximize the list of 86400 user counts to find the time that the population of online users is the largest.
Upvotes: 0