Reputation: 71
I am working with strings that are recursively concatenated to lengths of around 80 million characters. Python slows down dramatically as the string length increases.
Consider the following loop:
s = ''
for n in range(0,r):
s += 't'
I measure the time it takes to run to be 86ms for r = 800,000, 3.11 seconds for r = 8,000,000 and 222 seconds for r = 80,000,000
I am guessing this has something to do with how python allocates additional memory for the string. Is there a way to speed this up, such as allocating the full 80MB to the string s when it is declared?
Upvotes: 2
Views: 1623
Reputation: 110516
It can't be done in a straightforward way with text (string) objects, but it it is trivial if you are dealing with bytes - in that case, you can create a bytearray object larger than your final outcome and insert your values into it.
If you need the final object as text, you can then decode it to text, the single step will be fast enough.
As you don't state what is the nature of your data, it may become harder - if no single-byte encoding can cover all the characters you need, you have to resort to a variable-lenght encoding, such as utf-8, or a multibyte encoding, such as utf-16 or 32. In both cases, it is no problem if you keep proper track of your insetion index - which will be also your final datasize for re-encoding. (If all you are using are genetic "GATACA" strings, just use ASCII encoding, and you are done)
data = bytearray(100_000_000) # 100 million positions -
index = 0
for character in input_data:
v = character.encode("utf-8")
s = len(v)
if s == 1:
data[index] = v
else:
data[index: index + len(v)] = v
index += len(v)
data_as_text = data[:index].decode("utf-8")
Upvotes: 3
Reputation:
When you have a string value and change it in your program, then the previous value will remain in a part of memory and the changed string will be placed in a new part of RAM.
As a result, the old values in RAM remain unused.
For this purpose, Garbage Collector is used and cleans your RAM from old, unused values But it will take time.
You can do this yourself. You can use the gc module to see different generations of your objects See this:
import gc
print(gc.get_threshold())
result:
(596, 2, 1)
In this example, we have 596 objects in our youngest generation, two objects in the next generation, and one object in the oldest generation. For this reason, the allocation speed may be slow and your program may slow down
use this link to efficient String Concatenation in Python
Good luck.
Upvotes: 2