pstatix
pstatix

Reputation: 3848

Storing file in lists uses 10x memory as file size

I have an ASCII file that is essentially a grid of 16-bit signed integers; the file size on disk is approximately 300MB. I do not need to read the file into memory, but do need to store its contents as a single container (of containers), so for initial testing on memory use I tried list and tuples as inner containers with the outer container always as a list via list comprehension:

with open(file, 'r') as f:
    for _ in range(6):
        t = next(f) # skipping some header lines
    # Method 1
    grid = [line.strip().split() for line in f] # produces a 3.3GB container
    # Method 2 (on another run)
    grid = [tuple(line.strip().split()) for line in f] # produces a 3.7GB container

After discussing use of the grid amongst the team, I need to keep it as a list of lists up until a certain point at which time I will then convert it to a list of tuples for program execution.

What I am curious about is how a 300MB file can have its lines stored in a container of containers and have its overall size be 10x the original raw file size. Does each container really occupy that much memory space for holding a single line each?

Upvotes: 4

Views: 155

Answers (1)

Noctis Skytower
Noctis Skytower

Reputation: 21991

If you are concerned about storing data in memory and do not want to use tools outside of the standard library, you might want to take a look at the array module. It is designed to store numbers very efficiently in memory, and the array.array class accept various type codes based on the characteristics of the numbers you want stored. The following is a simple demonstration of how you might want to adapt the module for your use:

#! /usr/bin/env python3
import array
import io
import pprint
import sys

CONTENT = '''\
Header 1
Header 2
Header 3
Header 4
Header 5
Header 6
 0 1 2 3 4 -5 -6 -7 -8 -9 
 -9 -8 -7 -6 -5 4 3 2 1 0 '''


def main():
    with io.StringIO(CONTENT) as file:
        for _ in range(6):
            next(file)
        grid = tuple(array.array('h', map(int, line.split())) for line in file)
    print('Grid takes up', get_size_of_grid(grid), 'bytes of memory.')
    pprint.pprint(grid)


def get_size_of_grid(grid):
    return sys.getsizeof(grid) + sum(map(sys.getsizeof, grid))


if __name__ == '__main__':
    main()

Upvotes: 1

Related Questions