Reputation: 1493
I have a 3 columns file of about 28Gb. I would like to read it with python and put its content in a list of 3D tuples. Here's the code I'm using :
f = open(filename)
col1 = [float(l.split()[0]) for l in f]
f.seek(0)
col2 = [float(l.split()[1]) for l in f]
f.seek(0)
col3 = [float(l.split()[2]) for l in f]
f.close()
rowFormat = [col1,col2,col3]
tupleFormat = zip(*rowFormat)
for ele in tupleFormat:
### do something with ele
There's no 'break' command in the for loop, meaning that I actually read the whole content of the file. When the script is being run, I notice from the 'htop' command that it takes 156G of virtual memory (VIRT column) and almost the same amount for the resident memory (RES column). Why is my script using 156G whereas the file size is only 28G ?
Upvotes: 2
Views: 2174
Reputation: 9796
Python objects have a lot of overheard, e.g., reference count to the object and other stuff. That means that a Python float is more than 8 bytes. On my 32bit Python version, it is
>>> import sys
>>> print(sys.getsizeof(float(0))
16
A list has its own overhead and then requires 4 bytes per element to store a reference to that object. So 100 floats in a list actually take up a size of
>>> a = map(float, range(100))
>>> sys.getsizeof(a) + sys.getsizeof(a[0])*len(a)
2036
Now, a numpy array is different. It has a little bit of overhead, but the raw data under the hood are stored like in C.
>>> import numpy as np
>>> b = np.array(a)
>>> sys.getsizeof(b)
848
>>> b.itemsize # number of bytes per element
8
So a Python float requires 20 bytes compared to 8 for numpy. And 64bit Python versions require even more.
So really, if you must load A LOT of data in memory, numpy is one way to go. Looking at the way you load the data, I assume it's in text format with 3 floats per row, split by an arbitrary number of spaces. In that case, you could simply will use numpy.genfromtxt()
data = np.genfromtxt(fname, autostrip=True)
You could also look for more options here, e.g., mmap, but I don't know much about it to say whether it'd be more appropriate for you.
Upvotes: 5
Reputation: 9412
Can you get by w/o storing every tuple? I.e. can "do something" happen as you read in the file? If so... try this:
#!/usr/bin/env python
import fileinput
for line in fileinput.FileInput('test.dat'):
ele = tuple((float(x) for x in line.strip().split()))
# Replace 'print' with your "do something".
# Note that ele is now a generator, not a tuple. Wrap it in
# ele = tuple(ele) to get a tuple instead if you need it.
print ele
If not, maybe you can save some memory by choosing either the column format or the list of tuples format, but not BOTH, for example....
#!/usr/bin/env python
import fileinput
elements = []
for line in fileinput.FileInput('test.dat'):
elements.append(tuple((float(x) for x in line.strip().split())))
for ele in elements:
# do something
Upvotes: 0
Reputation: 4379
You need to read it line by line lazily using a generator. Try this:
col1 = []
col2 = []
col3 = []
rowFormat = [col1, col2, col3]
with open('test', 'r') as f:
for line in f:
parts = line.split()
col1.append(float(parts[0]))
col2.append(float(parts[1]))
col3.append(float(parts[2]))
# if possible do something here to start seeing results immediately
tupleFormat = zip(*rowFormat)
for ele in tupleFormat:
### do something with ele
You can add your logic in the for loop so you don't wait for the whole process to finish.
Upvotes: 0