Reputation: 9
I'm working on a project when I do format preserving encryption (that includes the three types alphabetic, alphanumeric and numeric ) well to achieve this I wrote several methods, then I wrote a method that takes as an input a file and a separator, I split the text in the file by the separator using the string method "split()" and I call encrypt method on each string ,and the encrypt calls a lot of other methods to achieve the FPE encryption, then I open another file to write on it the resulting encrypted text.
the problem is when I tested on a text file of 1 million line the encryption took 18 mins I did some optimization things for example I used list of comprehension instead of for loops cuz they are faster, I tried to avoid operations on strings cuz they cost a lot and the result was 8 mins which is a good improvement but not enough, well I wanted to use numba the problem is that I'm using methods inside of a class and @jit don't work properly (I have some object that it doesnt know them), then I tried PyPy ad the improvement was impressive for the same file I got 2 mins 10s . but still its too long cuz then I tried a file that has 10 millions lines and with pypy it takes 28 mins to be encrypted. what can I do to get more speed ???
part of the code:
def tokenize_text(self, text, separator):
encrypted = []
for string in text.split(separator):
encrypted.append(self.encrypt(string))
return separator.join(encrypted)
def tokenize_file(self, file, separator, output_file=None):
with open(file, 'r', encoding='utf-8') as f1:
text = f1.read()
if output_file is None:
base, ext = file.rsplit('.', 1)
output_file = f"{base}_tokenized.{ext}"
with open(output_file, 'w', encoding='utf-8') as f2:
f2.write(self.tokenize_text(text, separator))
return output_file
Upvotes: 0
Views: 77
Reputation: 36700
Here
def tokenize_text(self, text, separator):
encrypted = []
for string in text.split(separator):
encrypted.append(self.encrypt(string))
return separator.join(encrypted)
you are doing repeated .append
s to list
which according to wiki.python.org
may take surprisingly long, depending on the history of the container.
You might avoid that by doing
return separator.join(map(self.encrypt,text.split(separator)))
Please test that change and write if or how did it changed time required
Upvotes: 1