prehistoricpenguin
prehistoricpenguin

Reputation: 6326

Huge amout data analysis

Say we have about 1e10 lines of log file everyday, each one contains: an ID number(integer below 15 digits length), a login time, and a logout time. Some ID may login and logout several times.

Question 1:

How to count the total number of ID that have logged in?(We should not count each ID twice or more)

I tried to use a hashtable here, but I found the memory we should obtained may be so large.


Question 2:

Calculate the time when the population of online users are largest.

I think we may split the time of a day into 86400 seconds, then for each line of log file, add 1 to each seconds in the online interval. Or maybe I can sort the log file by login time?

Upvotes: 2

Views: 255

Answers (5)

Pradeep Pati
Pradeep Pati

Reputation: 5919

you can do that in a *nix shell.

  1. cut -f1 logname.log | sort | uniq | wc -l
  2. cut -f2 logname.log | sort | uniq -c | sort -r

Upvotes: 6

didierc
didierc

Reputation: 14750

  1. use a segment tree to store intervals of consecutive ids. Scan the logs for all the login events. To insert an id, first search a segment containing the id: if it exists, the id is a duplicate. If it doesn't search the segments which are right after or before the id. If they exist, remove them and merge the new id as needed, and insert the new segment. If they don't exist, insert the id as a segment of 1 element.

    Once all ids have been inserted, count their number by summing the cardinals of all the segments in the tree.

  2. assuming that:

    • a given id may be logged in only once at any given time,
    • events are stored in chronological order (that's what logs are normally)

    Scan the log and keep a counter c of the number of currently logged in users, as well as the max number m found, and the associated time t. For each log in, increment c, and for each log out decrement it. At each step update m and t if m is lower than c.

Upvotes: 0

Wayne Conrad
Wayne Conrad

Reputation: 108129

For question 1, forget C++ and use *nix tools. Assuming the log file is space delimited, then the number of unique logins in a given log is computed by:

$ awk '{print $1}' foo.log | sort | uniq | wc -l

Gnu sort, will happily sort files larger than memory. Here's what each piece is doing:

  • awk is extracting the first space-delimited column (the ID number).
  • sort is sorting those ID numbers, because uniq needs sorted input.
  • uniq is returning only uniq numbers.
  • wc prints the number of lines, which will be the number of uniq numbers.

Upvotes: 1

carlosdc
carlosdc

Reputation: 12152

For question 2 to make sense: you probably have to log 2 things: user logs in and user logs out. Two different activities along with the user id. If this list is sorted by the time in which the activity (either log in or log out was done). You just scan with a counter called currentusers: add 1 for each log in and -1 for each log out. The maximum that number (current users) reaches is the value you're interested in, you will probably be interested also in tracking at what time it occurred..

Upvotes: 1

Daryl
Daryl

Reputation: 419

For 1, you can try working with fragments of the file at a time that are small enough to fit into memory. i.e. instead of

countUnique([1, 2, ... 1000000])

try

countUnique([1, 2, ... 1000]) + 
countUnique([1001, 1002, ... 2000]) + 
countUnique([2001, 2002, ...]) + ... + countUnique([999000, 999001, ... 1000000])

2 is a bit more tricky. Partitioning the work into manageable intervals (a second, as you suggested) is a good idea. For each second, find the number of people logged in during taht second by using the following check (pseudocode):

def loggedIn(loginTime, logoutTime, currentTimeInterval):
    return user is logged in during currentTimeInterval

Apply loggedIn to all 86400 seconds, and then maximize the list of 86400 user counts to find the time that the population of online users is the largest.

Upvotes: 0

Related Questions