Abdelouahed Abbad
Abdelouahed Abbad

Reputation: 9

Optimization of a python code that involves reading from a huge file then splitting by a separator then encrypting with preserving format each string

I'm working on a project when I do format preserving encryption (that includes the three types alphabetic, alphanumeric and numeric ) well to achieve this I wrote several methods, then I wrote a method that takes as an input a file and a separator, I split the text in the file by the separator using the string method "split()" and I call encrypt method on each string ,and the encrypt calls a lot of other methods to achieve the FPE encryption, then I open another file to write on it the resulting encrypted text.

the problem is when I tested on a text file of 1 million line the encryption took 18 mins I did some optimization things for example I used list of comprehension instead of for loops cuz they are faster, I tried to avoid operations on strings cuz they cost a lot and the result was 8 mins which is a good improvement but not enough, well I wanted to use numba the problem is that I'm using methods inside of a class and @jit don't work properly (I have some object that it doesnt know them), then I tried PyPy ad the improvement was impressive for the same file I got 2 mins 10s . but still its too long cuz then I tried a file that has 10 millions lines and with pypy it takes 28 mins to be encrypted. what can I do to get more speed ???

part of the code:

    def tokenize_text(self, text, separator):
        encrypted = []
        for string in text.split(separator):
            encrypted.append(self.encrypt(string))
        return separator.join(encrypted)
    def tokenize_file(self, file, separator, output_file=None):
        with open(file, 'r', encoding='utf-8') as f1:
            text = f1.read()
        if output_file is None:
            base, ext = file.rsplit('.', 1)
            output_file = f"{base}_tokenized.{ext}"
        with open(output_file, 'w', encoding='utf-8') as f2:
            f2.write(self.tokenize_text(text, separator))
        return output_file

Upvotes: 0

Views: 77

Answers (1)

Daweo
Daweo

Reputation: 36700

Here

def tokenize_text(self, text, separator):
    encrypted = []
    for string in text.split(separator):
        encrypted.append(self.encrypt(string))
    return separator.join(encrypted)

you are doing repeated .appends to list which according to wiki.python.org

may take surprisingly long, depending on the history of the container.

You might avoid that by doing

return separator.join(map(self.encrypt,text.split(separator)))

Please test that change and write if or how did it changed time required

Upvotes: 1

Related Questions