Examining large log files in python

Question

A little hesitant about posting this - as far as I'm concerned it's genuine question, but I guess I'll understand if it's criticised or closed as being an invite for discussion...

Anyway, I need to use Python to search some quite large web logs for specific events. RegEx would be good but I'm not tied to any particular approach - I just want lines that contain two strings that could appear anywhere in a GET request.

As a typical file is over 400mb and contains around a million lines, performance both in terms of time to complete and load on the server (ubuntu/nginx VM - reasonably well spec'd and rarely overworked) are likely to be issues.

I'm a fairly recent convert to Python (note quite a newbie but still plenty to learn) and I'd like a bit of guidance on the best way to achieve this

Do I open and iterate through? Grep to a new file and then open? Some combination of the two? Something else?

Vyktor · Accepted Answer

As long as you don't read whole file at once but iterate trough it continuously you should be fine. I think it doesn't really matter whether you read whole file with python or with grep, you still have to load whole file :). And if you take advantage of generators you can do this really programmer friendly:

# Generator; fetch specific rows from log file
def parse_log(filename):
    reg = re.prepare( '...')

    with open(filename,'r') as f:
       for row in f:
           match = reg.match(row)
           if match:
               yield match.group(1)

for i in parse_log('web.log'):
    pass # Do whatever you need with matched row

Examining large log files in python

Answers (1)

Related Questions