user1002288
user1002288

Reputation: 5040

Finding the most common three-item sequence in a very large file

I have many log files of webpage visits, where each visit is associated with a user ID and a timestamp. I need to identify the most popular (i.e. most often visited) three-page sequence of all. The log files are too large to be held in main memory at once.

Sample log file:

User ID  Page ID
A          1
A          2
A          3
B          2
B          3
C          1
B          4
A          4

Corresponding results:

A: 1-2-3, 2-3-4
B: 2-3-4
2-3-4 is the most popular three-page sequence

My idea is to use use two hash tables. The first hashes on user ID and stores its sequence; the second hashes three-page sequences and stores the number of times each one appears. This takes O(n) space and O(n) time.

However, since I have to use two hash tables, memory cannot hold everything at once, and I have to use disk. It is not efficient to access disk very often.

How can I do this better?

Upvotes: 13

Views: 3751

Answers (5)

Mooing Duck
Mooing Duck

Reputation: 66961

There's probably syntax errors galore here, but this should take a limited amount of RAM for a virtually unlimited length log file.

typedef int pageid;
typedef int userid;
typedef pageid[3] sequence;
typedef int sequence_count;

const int num_pages = 1000; //where 1-1000 inclusive are valid pageids
const int num_passes = 4;
std::unordered_map<userid, sequence> userhistory;
std::unordered_map<sequence, sequence_count> visits;
sequence_count max_count=0;
sequence max_sequence={};
userid curuser;
pageid curpage;
for(int pass=0; pass<num_passes; ++pass) { //have to go in four passes
    std::ifstream logfile("log.log");
    pageid minpage = num_pages/num_passes*pass; //where first page is in a range
    pageid maxpage = num_pages/num_passes*(pass+1)+1;
    if (pass==num_passes-1) //if it's last pass, fix rounding errors
        maxpage = MAX_INT;
    while(logfile >> curuser >> curpage) { //read in line
        sequence& curhistory = userhistory[curuser]; //find that user's history
        curhistory[2] = curhistory[1];
        curhistory[1] = curhistory[0];
        curhistory[0] = curhistory[curpage]; //push back new page for that user
        //if they visited three pages in a row
        if (curhistory[2] > minpage && curhistory[2]<maxpage) { 
            sequence_count& count = visits[curhistory]; //get times sequence was hit
            ++count; //and increase it
            if (count > max_count) { //if that's new max
                max_count = count;  //update the max
                max_sequence = curhistory; //arrays, so this is memcpy or something
            }
        }
    }
}
std::cout << "The sequence visited the most is :\n";
std::cout << max_sequence[2] << '\n';
std::cout << max_sequence[1] << '\n';
std::cout << max_sequence[0] << '\n';
std::cout << "with " << max_count << " visits.\n";

Note that If you pageid or userid are strings instead of ints, you'll take a significant speed/size/caching penalty.

[EDIT2] It now works in 4 (customizable) passes, which means it uses less memory, making this work realistically in RAM. It just goes proportionately slower.

Upvotes: 3

Evgeny Kluev
Evgeny Kluev

Reputation: 24677

If you want to quickly get an approximate result, use hash tables, as you intended, but add a limited-size queue to each hash table to drop least recently used entries.

If you want exact result, use external sort procedure to sort logs by userid, then combine every 3 consecutive entries and sort again, this time - by page IDs.

Update (sort by timestamp)

Some preprocessing may be needed to properly use logfiles' timestamps:

  • If the logfiles are already sorted by timestamp, no preprocessing needed.
  • If there are several log files (possibly coming from independent processes), and each file is already sorted by timestamp, open all these files and use merge sort to read them.
  • If files are almost sorted by timestamp (as if several independent processes write logs to single file), use binary heap to get data in correct order.
  • If files are not sorted by timestamp (which is not likely in practice), use external sort by timestamp.

Update2 (Improving approximate method)

Approximate method with LRU queue should produce quite good results for randomly distributed data. But webpage visits may have different patterns at different time of day, or may be different on weekends. The original approach may give poor results for such data. To improve this, hierarchical LRU queue may be used.

Partition LRU queue into log(N) smaller queues. With sizes N/2, N/4, ... Largest one should contain any elements, next one - only elements, seen at least 2 times, next one - at least 4 times, ... If element is removed from some sub-queue, it is added to other one, so it lives in all sub-queues, which are lower in hierarchy, before it is completely removed. Such a priority queue is still of O(1) complexity, but allows much better approximation for most popular page.

Upvotes: 4

Matthew Strawbridge
Matthew Strawbridge

Reputation: 20640

If you are using Unix, the sort command can cope with arbitrarily large files. So you could do something like this:

  1. sort -k1,1 -s logfile > sorted (note that this is a stable sort (-s) on the first column)
  2. Perform some custom processing of sorted that outputs each triplet as a new line to another file, say triplets, using either C++ or a shell script. So in the example given you get a file with three lines: 1-2-3, 2-3-4, 2-3-4. This processing is quick because Step 1 means that you are only dealing with one user's visits at a time, so you can work through the sorted file a line at a time.
  3. sort triplets | uniq -c | sort -r -n | head -1 should give the most common triplet and its count (it sorts the triplets, counts the occurrences of each, sorts them in descending order of count and takes the top one).

This approach might not have optimal performance, but it shouldn't run out of memory.

Upvotes: 1

Rob H
Rob H

Reputation: 1716

I think you only have to store the most recently seen triple for each userid right? So you have two hash tables. The first containing key of userid, value of most recently seen triple has size equal to number of userids.

EDIT: assumes file sorted by timestamp already.

The second hash table has a key of userid:page-triple, and a value of count of times seen.

I know you said c++ but here's some awk which does this in a single pass (should be pretty straight-forward to convert to c++):

#  $1 is userid, $2 is pageid

{
    old = ids[$1];          # map with id, most-recently-seen triple
    split(old,oldarr,"-"); 
    oldarr[1]=oldarr[2]; 
    oldarr[2]=oldarr[3]; 
    oldarr[3] = $2; 
    ids[$1]=oldarr[1]"-"oldarr[2]"-"oldarr[3]; # save new most-recently-seen
    tripleid = $1":"ids[$1];  # build a triple-id of userid:triple
    if (oldarr[1] != "") { # don't accumulate incomplete triples
        triples[tripleid]++; }   # count this triple-id
}
END {
    MAX = 0;
    for (tid in  triples) {
        print tid" "triples[tid];
        if (triples[tid] > MAX) MAX = tid;
    }
    print "MAX is->" MAX" seen "triples[tid]" times";
}

Upvotes: 1

ams
ams

Reputation: 25599

If you have 1000 web pages then you have 1 billion possible 3-page sequences. If you have a simple array of 32-bit counters then you'd use 4GB of memory. There might be ways to prune this down by discarding data as you go, but if you want to guarantee to get the correct answer then this is always going to be your worst case - there's no avoiding it, and inventing ways to save memory in the average case will make the worst case even more memory hungry.

On top of that, you have to track the users. For each user you need to store the last two pages they visited. Assuming the users are referred to by name in the logs, you'd need to store the users' names in a hash table, plus the two page numbers, so let's say 24 bytes per user on average (probably conservative - I'm assuming short user names). With 1000 users that would be 24KB; with 1000000 users 24MB.

Clearly the sequence counters dominate the memory problem.

If you do only have 1000 pages then 4GB of memory is not unreasonable in a modern 64-bit machine, especially with a good amount of disk-backed virtual memory. If you don't have enough swap space, you could just create an mmapped file (on Linux - I presume Windows has something similar), and rely on the OS to always have to most used cases cached in memory.

So, basically, the maths dictates that if you have a large number of pages to track, and you want to be able to cope with the worst case, then you're going to have to accept that you'll have to use disk files.

I think that a limited-capacity hash table is probably the right answer. You could probably optimize it for a specific machine by sizing it according to the memory available. Having got that you need to handle the case where the table reaches capacity. It may not need to be terribly efficient if it's likely you rarely get there. Here's some ideas:

  • Evict the least commonly used sequences to file, keeping the most common in memory. I'd need two passes over the table to determine what level is below average, and then to do the eviction. Somehow you'd need to know where you'd put each entry, whenever you get a hash-miss, which might prove tricky.

  • Just dump the whole table to file, and build a new one from scratch. Repeat. Finally, recombine the matching entries from all the tables. The last part might also prove tricky.

  • Use an mmapped file to extend the table. Ensure that the file is used primarily for the least-commonly used sequences, as in my first suggestion. Basically, you'd simply use it as virtual memory - the file would be meaningless later, after the addresses have been forgotten, but you wouldn't need to keep it that long. I'm assuming there isn't enough regular virtual memory here, and/or you don't want to use it. Obviously, this is for 64-bit systems only.

Upvotes: 2

Related Questions