user5054
user5054

Reputation: 1096

Memory issues converting a huge list of list to a data frame

I am having out of memory error in the last line of a piece of code which creates a huge list of list in a for loop and then converts it to a data frame.

Here is a minimal reproducible example. I believe there is a lot of additional memory used in the last line. How can I make the code more memory-efficient?

import random, pandas, string
def function_to_generate_list():
    def random_string():
        return ''.join(random.choices(string.ascii_uppercase + string.digits, k = 50))
    return [random_string(), random_string(), random.random()]
len = 10000*20000
df = []
for i in range(len):
    df.append(function_to_generate_list())
df = pandas.DataFrame(df, columns=['column1', 'column2', 'column3'])

Upvotes: 1

Views: 1056

Answers (2)

Bill Huang
Bill Huang

Reputation: 4648

The best option is to pre-allocate the storage object in the default underlying format used by DataFrame container (i.e. np.array). In this way, a DataFrame can be created by directly referencing these arrays instead of making transformed copies of them, thus reducing the memory footprint approximately by half.

Solution

import tracemalloc
import numpy as np

# provided function omitted

tracemalloc.start()

# preallocated output
arr1 = np.zeros(length, dtype=object)
arr2 = np.zeros(length, dtype=object)
arr3 = np.zeros(length, dtype=float)

# assign directly        
for i in range(length):
    arr1[i], arr2[i], arr3[i] = function_to_generate_list()

# make it a dataframe
df = pd.DataFrame(
    {'column1': arr1, 'column2': arr2, 'column3': arr3}
)

print(f"===== Memory Footprint =====")
first, peak = tracemalloc.get_traced_memory()
print(f"Peak memory usage: {peak} ({peak/1048576:.3f}M)")

Benchmark

k=5 and length=1000000 is used in the benchmark. Peak memory usage is reported for different methods. The benchmark is performed on a Core i5-8250U (4C8T) 64-bit laptop running debian 10. The benchmark is performed by inserting the solution code between tracemalloc.start() and tracemalloc.get_traced_memory().

  • This solution: 148.825M <- winner
  • Generator solution: 288.355M
  • Original solution: 288.780M

One can see that using a generator won't help, likely because an intermediate non-array object was still generated before producing a DataFrame.

Upvotes: 1

go2nirvana
go2nirvana

Reputation: 1638

Consider using generators.

When you do this df.append(function_to_generate_list()) you don't actually need to have all values stored at once. When you create a dataframe, it will only iterate over each value once, so you only need one value at a time. And that is what generators do.

You could modify your last line like this:

df = pandas.DataFrame((function_to_generate_list() for _ in range(length)), 
                      columns=['column1', 'column2', 'column3'])

Also note that i renamed len to length. Because when you create a variable len you overwrite a built-in function len.

Upvotes: 0

Related Questions