Reputation: 184
I have a Python script that is used to parse emails from large documents. This script is using all my RAM on my machine and makes it lock up to where I have to restart it. I was wondering if there is a way I can limit this or maybe even have a pause after it gets done reading one file and providing some output. Any help would be great thank you.
#!/usr/bin/env python
# Extracts email addresses from one or more plain text files.
#
# Notes:
# - Does not save to file (pipe the output to a file if you want it saved).
# - Does not check for duplicates (which can easily be done in the terminal).
# - Does not save to file (pipe the output to a file if you want it saved).
# Twitter @Critical24 - DefensiveThinking.io
from optparse import OptionParser
import os.path
import re
regex = re.compile(("([a-z0-9!#$%&'*+\/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+\/=?^_`"
"{|}~-]+)*(@|\sat\s)(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?(\.|"
"\sdot\s))+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?)"))
def file_to_str(filename):
"""Returns the contents of filename as a string."""
with open(filename, encoding='utf-8') as f: #Added encoding='utf-8'
return f.read().lower() # Case is lowered to prevent regex mismatches.
def get_emails(s):
"""Returns an iterator of matched emails found in string s."""
# Removing lines that start with '//' because the regular expression
# mistakenly matches patterns like 'http://[email protected]' as '//[email protected]'.
return (email[0] for email in re.findall(regex, s) if not email[0].startswith('//'))
import os
not_parseble_files = ['.txt', '.csv']
for root, dirs, files in os.walk('.'):#This recursively searches all sub directories for files
for file in files:
_,file_ext = os.path.splitext(file)#Here we get the extension of the file
file_path = os.path.join(root,file)
if file_ext in not_parseble_files:#We make sure the extension is not in the banned list 'not_parseble_files'
print("File %s is not parseble"%file_path)
continue #This one continues the loop to the next file
if os.path.isfile(file_path):
for email in get_emails(file_to_str(file_path)):
print(email)
Upvotes: 1
Views: 343
Reputation: 82889
It seems like you are reading files with up to 8 GB into memory, using f.read()
. Instead, you could try applying the regex to each line of the file, without ever having the entire file in memory.
with open(filename, encoding='utf-8') as f: #Added encoding='utf-8'
return (email[0] for line in f
for email in re.findall(regex, line.lower())
if not email[0].startswith('//'))
This can still take a very long time, though. Also, I did not check your regex for possible problems.
Upvotes: 1