Optimization of a python code that involves reading from a huge file then splitting by a separator then encrypting with preserving format each string

Question

I'm working on a project when I do format preserving encryption (that includes the three types alphabetic, alphanumeric and numeric ) well to achieve this I wrote several methods, then I wrote a method that takes as an input a file and a separator, I split the text in the file by the separator using the string method "split()" and I call encrypt method on each string ,and the encrypt calls a lot of other methods to achieve the FPE encryption, then I open another file to write on it the resulting encrypted text.

the problem is when I tested on a text file of 1 million line the encryption took 18 mins I did some optimization things for example I used list of comprehension instead of for loops cuz they are faster, I tried to avoid operations on strings cuz they cost a lot and the result was 8 mins which is a good improvement but not enough, well I wanted to use numba the problem is that I'm using methods inside of a class and @jit don't work properly (I have some object that it doesnt know them), then I tried PyPy ad the improvement was impressive for the same file I got 2 mins 10s . but still its too long cuz then I tried a file that has 10 millions lines and with pypy it takes 28 mins to be encrypted. what can I do to get more speed ???

part of the code:

    def tokenize_text(self, text, separator):
        encrypted = []
        for string in text.split(separator):
            encrypted.append(self.encrypt(string))
        return separator.join(encrypted)
    def tokenize_file(self, file, separator, output_file=None):
        with open(file, 'r', encoding='utf-8') as f1:
            text = f1.read()
        if output_file is None:
            base, ext = file.rsplit('.', 1)
            output_file = f"{base}_tokenized.{ext}"
        with open(output_file, 'w', encoding='utf-8') as f2:
            f2.write(self.tokenize_text(text, separator))
        return output_file

Optimization of a python code that involves reading from a huge file then splitting by a separator then encrypting with preserving format each string

Answers (1)

Related Questions