Reputation: 1436
A little hesitant about posting this - as far as I'm concerned it's genuine question, but I guess I'll understand if it's criticised or closed as being an invite for discussion...
Anyway, I need to use Python to search some quite large web logs for specific events. RegEx would be good but I'm not tied to any particular approach - I just want lines that contain two strings that could appear anywhere in a GET request.
As a typical file is over 400mb and contains around a million lines, performance both in terms of time to complete and load on the server (ubuntu/nginx VM - reasonably well spec'd and rarely overworked) are likely to be issues.
I'm a fairly recent convert to Python (note quite a newbie but still plenty to learn) and I'd like a bit of guidance on the best way to achieve this
Do I open and iterate through? Grep to a new file and then open? Some combination of the two? Something else?
Upvotes: 3
Views: 631
Reputation: 20997
As long as you don't read whole file at once but iterate trough it continuously you should be fine. I think it doesn't really matter whether you read whole file with python or with grep
, you still have to load whole file :). And if you take advantage of generators you can do this really programmer friendly:
# Generator; fetch specific rows from log file
def parse_log(filename):
reg = re.prepare( '...')
with open(filename,'r') as f:
for row in f:
match = reg.match(row)
if match:
yield match.group(1)
for i in parse_log('web.log'):
pass # Do whatever you need with matched row
Upvotes: 2