Reputation: 1096
I am having out of memory
error in the last line of a piece of code which creates a huge list of list in a for loop and then converts it to a data frame.
Here is a minimal reproducible example. I believe there is a lot of additional memory used in the last line. How can I make the code more memory-efficient?
import random, pandas, string
def function_to_generate_list():
def random_string():
return ''.join(random.choices(string.ascii_uppercase + string.digits, k = 50))
return [random_string(), random_string(), random.random()]
len = 10000*20000
df = []
for i in range(len):
df.append(function_to_generate_list())
df = pandas.DataFrame(df, columns=['column1', 'column2', 'column3'])
Upvotes: 1
Views: 1056
Reputation: 4648
The best option is to pre-allocate the storage object in the default underlying format used by DataFrame container (i.e. np.array
). In this way, a DataFrame can be created by directly referencing these arrays instead of making transformed copies of them, thus reducing the memory footprint approximately by half.
import tracemalloc
import numpy as np
# provided function omitted
tracemalloc.start()
# preallocated output
arr1 = np.zeros(length, dtype=object)
arr2 = np.zeros(length, dtype=object)
arr3 = np.zeros(length, dtype=float)
# assign directly
for i in range(length):
arr1[i], arr2[i], arr3[i] = function_to_generate_list()
# make it a dataframe
df = pd.DataFrame(
{'column1': arr1, 'column2': arr2, 'column3': arr3}
)
print(f"===== Memory Footprint =====")
first, peak = tracemalloc.get_traced_memory()
print(f"Peak memory usage: {peak} ({peak/1048576:.3f}M)")
k=5
and length=1000000
is used in the benchmark. Peak memory usage is reported for different methods. The benchmark is performed on a Core i5-8250U (4C8T) 64-bit laptop running debian 10. The benchmark is performed by inserting the solution code between tracemalloc.start()
and tracemalloc.get_traced_memory()
.
One can see that using a generator won't help, likely because an intermediate non-array object was still generated before producing a DataFrame.
Upvotes: 1
Reputation: 1638
Consider using generators.
When you do this df.append(function_to_generate_list())
you don't actually need to have all values stored at once. When you create a dataframe, it will only iterate over each value once, so you only need one value at a time. And that is what generators do.
You could modify your last line like this:
df = pandas.DataFrame((function_to_generate_list() for _ in range(length)),
columns=['column1', 'column2', 'column3'])
Also note that i renamed len
to length
. Because when you create a variable len
you overwrite a built-in function len
.
Upvotes: 0