Reputation: 903
I have the following code to search all utf16
encoded .ini
files and search their contents but RAM usage immediately jumps from 1.3GB to 3.9GB that cause my PC to crash. What certain module is the reason here? How can I do better for this?
import os
import chardet
import shutil
dir_path = os.path.dirname(os.path.realpath(__file__))
string = r"\v5."
def get_encoding(filename):
filebyte = open(filename, 'rb')
detect_encoding = chardet.detect(filebyte.read())
file_encoding = detect_encoding['encoding']
filebyte.close()
return file_encoding
for root, dirs, files in os.walk(dir_path):
for file in files:
if file.endswith('.ini'):
filepath = root+'/'+str(file)
encoding = get_encoding(filepath)
if encoding == "UTF-16":
print (filepath)
Upvotes: 0
Views: 375
Reputation: 43870
Hmmm... I agree with @jmd_dk it seems very strange to have .ini
files that are that big.
Is there some other code that's going on that's not posted?
So, it doesn't appear that your code is memory limiting unless you have an .ini
file over a 1Gb.
In any case I would recommend using pathlib
to make things a bit easier for you.
import os
import chardet
import shutil
from pathlib import Path
dir_path = Path(__file__).parent
for item in dir_path.rglob('*.ini'): # recursive glob
with item.open('rb') as filebytes:
detected = chardet.detect(filebytes.read())
if detected['encoding'] == 'UTF-16':
print(item)
Upvotes: 1
Reputation: 153
I'd suggest going through in the debugger to find the particular line causing you an issue, but I suspect if your files are very large, you're reading the whole thing into memory. You could check this by evaluating the memory before the call to open(), then during processing, then after close(). You could get around the issue by streaming the file using the io module.
I could also suggest you use the 'with' syntax rather than explicitly calling open() and close(), this ensures you don't forget to close anything. Example:
def get_encoding(filename):
file_encoding = ""
with open(filename, 'rb') as filebyte:
detect_encoding = chardet.detect(filebyte.read())
file_encoding = detect_encoding['encoding']
return file_encoding
Upvotes: 0
Reputation: 149
I completely agree with jmd_dk but you could also incrementally detect encoding as described here Incremental Encoding Detection, this allows you to read in characters or lines until the detector reaches a specific level of certainty.
This would be helpful if you were concerned about reading too many or too few bytes. You could do something like:
if certainty < 0.90: read up to 1000 otherwise break
Upvotes: -1
Reputation: 13120
The problem could be that filebyte.read()
reads in the entire content of the file, which may be large. Although it does seem weird to have .ini files several GB in size. Try supplying a number to filebyte.read()
, so that it at max reads this many characters:
detect_encoding = chardet.detect(filebyte.read(1000))
Upvotes: 2