rpbear
rpbear

Reputation: 640

handling many huge log files with python

I'm using some python scripts to do statistics. One sort content of logs are like this I call it A logs: every A logs has the format of:

[2012-09-12 12:23:33] SOME_UNIQ_ID filesize

another logs I call it B logs has the format of:

[2012-09-12 12:24:00] SOME_UNIQ_ID

I need to count how many records in the A logs are also in the B logs, and get the time gap of the two records with the same record id.My implementation was load all time and ID of B logs into a map,then iterate the A logs to check if it's ID was exist in the map.The problem is it casts too much memory cause I have almost 100 million records in B logs.Any suggestion to improve the performance and memory usage? Thanks.

Upvotes: 6

Views: 1779

Answers (6)

pepr
pepr

Reputation: 20770

I suggest to use the database that has both support for the datetime and uniqueidentifier for the shown form of the unique id. It comes from Window, and if you use Windows for the task, you can use say Microsoft SQL 2008 R2 Express edition (for free). The two tables will not use any kind of key.

You can use bcp utility of the MS SQL that will probably be one of the fastest way to inser the data from the text file (or the BULK INSERT command).

The indexes on the uniqueidentifier should be created only after all the recors were inserted. Otherwise, the existence of indexes make the insert operation slower. Then the inner join should be as fast as technically possible.

Upvotes: 0

rpbear
rpbear

Reputation: 640

As the bottleneck is the translation of time stamps.I split this action into many isolate machines which A logs and B logs are generated.These machines translate the string stamps into a epoch time,and the CENTER machine which uses all these logs to calculate my result now takes almost 1/20 time of the original way.I post my solution here, thanks to all of you guys.

Upvotes: 0

Jon Clements
Jon Clements

Reputation: 142176

You could try reversing the lookup depending if "A" fits into memory and sequentially scan "B".

Otherwise, load the log files into a SQLite3 database with two tables (log_a, log_b) containing (timestamp, uniq_id, rest_of_line), then execute an SQL join on uniq_id, and do any processing you require on the results from that. This will keep the memory overhead low, enables the SQL engine to do the join, but of course does require effectively duplicating the log files on-disk (but that's generally not an issue on most systems)

example

import sqlite3
from datetime import datetime

db = sqlite3.connect(':memory:')

db.execute('create table log_a (timestamp, uniq_id, filesize)')
a = ['[2012-09-12 12:23:33] SOME_UNIQ_ID filesize']
for line in a:
    timestamp, uniq_id, filesize = line.rsplit(' ', 2)
    db.execute('insert into log_a values(?, ?, ?)', (timestamp, uniq_id, filesize))
db.commit()

db.execute('create table log_b (timestamp, uniq_id)')
b = ['[2012-09-12 13:23:33] SOME_UNIQ_ID']
for line in b:
    timestamp, uniq_id = line.rsplit(' ', 1)
    db.execute('insert into log_b values(?, ?)', (timestamp, uniq_id))
db.commit()

TIME_FORMAT = '[%Y-%m-%d %H:%M:%S]'
for matches in db.execute('select * from log_a join log_b using (uniq_id)'):
    log_a_ts = datetime.strptime(matches[0], TIME_FORMAT)
    log_b_ts = datetime.strptime(matches[3], TIME_FORMAT)
    print matches[1], 'has a difference of', abs(log_a_ts - log_b_ts)
    # 'SOME_UNIQ_ID has a difference of 1:00:00'
    # '1:00:00' == datetime.timedelta(0, 3600)

Note that:

  • the .connect on sqlite3 should be a filename
  • a and b should be your files

Upvotes: 3

Lukasz Madon
Lukasz Madon

Reputation: 14994

First off, what is the format of ID? Is is globally unique?

I'd choose one of those three options.

  • Use database
  • Union of two sets of ids
  • Unix tools

I assume you prefer a second option. Load just IDs from A and B. Assuming that id will fit into 32bit integer the memory usage will be less that 1GB. Then load the datetime of the same ids and calculate the gap. First option would be the best for the requirements.

Upvotes: 0

Hans Then
Hans Then

Reputation: 11322

If the unique ID can be sorted (e.g. alphabetically or numerically), you can batch the comparisons.

Suppose for the example the ID is numerical with a range of 1 - 10^7. Then you could first place the first 10^6 elements in your hashtable, do a sequential scan through the second file to find matching records.

In pseudopython, I have not tested this:

for i in xrange(0,9):
    for line in file1:
        time, id = line.split(']')
        id = int(id)
        if i * 10**6 < id < (i+1) * 10**6:
            hash_table[id] = time

    for line in file2:
        time, id = line.split(']') # needs a second split to get the id
        id = int(id)
        if id in hashtable:
            # compare timestamps

If the ID's are not numerical, you can create batches using a letter key:

if id.startswith(a_letter_from_the_alphabet):
    hash_table[id] = time

Upvotes: 0

Geordee Naliyath
Geordee Naliyath

Reputation: 1859

Try this:

  • Externally sort both the files
  • Read the A Logs file and save SOME_UNIQ_ID (A)
  • Read the B Logs file and save SOME_UNIQ_ID (B)
  • Compare the SOME_UNIQ_ID (B) with SOME_UNIQ_ID (A)
    • If it is lesser, read B Logs file again
    • If it is greater, read A Logs file again and compare with saved SOME_UNIQ_ID (B)
    • If it is equal find the time gap

Assuming external sort works efficiently, you end up the process reading both files just once.

Upvotes: 1

Related Questions