Reputation: 903

Python: How to modify for a memory-friendly code

I have the following code to search all utf16 encoded .ini files and search their contents but RAM usage immediately jumps from 1.3GB to 3.9GB that cause my PC to crash. What certain module is the reason here? How can I do better for this?

import os
import chardet
import shutil

dir_path = os.path.dirname(os.path.realpath(__file__)) 
string = r"\v5."

def get_encoding(filename):
    filebyte = open(filename, 'rb')
    detect_encoding = chardet.detect(filebyte.read())        
    file_encoding = detect_encoding['encoding']
    filebyte.close()
    return file_encoding

for root, dirs, files in os.walk(dir_path): 
    for file in files: 
        if file.endswith('.ini'):
            filepath = root+'/'+str(file)
            encoding = get_encoding(filepath)
            if encoding == "UTF-16":
                print (filepath)

Upvotes: 0

Answers (4)

monkut

Reputation: 43870

Hmmm... I agree with @jmd_dk it seems very strange to have .ini files that are that big. Is there some other code that's going on that's not posted?

So, it doesn't appear that your code is memory limiting unless you have an .ini file over a 1Gb.

In any case I would recommend using pathlib to make things a bit easier for you.

import os
import chardet
import shutil
from pathlib import Path

dir_path = Path(__file__).parent

for item in dir_path.rglob('*.ini'):  # recursive glob
    with item.open('rb') as filebytes:
        detected = chardet.detect(filebytes.read())
        if detected['encoding'] == 'UTF-16':
            print(item)

Upvotes: 1

Michael

Reputation: 153

I'd suggest going through in the debugger to find the particular line causing you an issue, but I suspect if your files are very large, you're reading the whole thing into memory. You could check this by evaluating the memory before the call to open(), then during processing, then after close(). You could get around the issue by streaming the file using the io module.

I could also suggest you use the 'with' syntax rather than explicitly calling open() and close(), this ensures you don't forget to close anything. Example:

def get_encoding(filename):
   file_encoding = ""

   with open(filename, 'rb') as filebyte:
      detect_encoding = chardet.detect(filebyte.read())
      file_encoding = detect_encoding['encoding']

   return file_encoding

Upvotes: 0

MarcusOuelletus

Reputation: 149

I completely agree with jmd_dk but you could also incrementally detect encoding as described here Incremental Encoding Detection, this allows you to read in characters or lines until the detector reaches a specific level of certainty.

This would be helpful if you were concerned about reading too many or too few bytes. You could do something like:

if certainty < 0.90: read up to 1000 otherwise break

Upvotes: -1

jmd_dk

Reputation: 13120

The problem could be that filebyte.read() reads in the entire content of the file, which may be large. Although it does seem weird to have .ini files several GB in size. Try supplying a number to filebyte.read(), so that it at max reads this many characters:

detect_encoding = chardet.detect(filebyte.read(1000))

Upvotes: 2

Python: How to modify for a memory-friendly code

Answers (4)

Related Questions