Reputation: 21592
Trying to load a file into python. It's a very big file (1.5Gb), but I have the available memory and I just want to do this once (hence the use of python, I just need to sort the file one time so python was an easy choice).
My issue is that loading this file is resulting in way to much memory usage. When I've loaded about 10% of the lines into memory, Python is already using 700Mb, which is clearly too much. At around 50% the script hangs, using 3.03 Gb of real memory (and slowly rising).
I know this isn't the most efficient method of sorting a file (memory-wise) but I just want it to work so I can move on to more important problems :D So, what is wrong with the following python code that's causing the massive memory usage:
print 'Loading file into memory'
input_file = open(input_file_name, 'r')
input_file.readline() # Toss out the header
lines = []
totalLines = 31164015.0
currentLine = 0.0
printEvery100000 = 0
for line in input_file:
currentLine += 1.0
lined = line.split('\t')
printEvery100000 += 1
if printEvery100000 == 100000:
print str(currentLine / totalLines)
printEvery100000 = 0;
lines.append( (lined[timestamp_pos].strip(), lined[personID_pos].strip(), lined[x_pos].strip(), lined[y_pos].strip()) )
input_file.close()
print 'Done loading file into memory'
EDIT: In case anyone is unsure, the general consensus seems to be that each variable allocated eats up more and more memory. I "fixed" it in this case by 1) calling readLines(), which still loads all the data, but only has one 'string' variable overhead for each line. This loads the entire file using about 1.7Gb. Then, when I call lines.sort(), I pass a function to key that splits on tabs and returns the right column value, converted to an int. This is slow computationally, and memory-intensive overall, but it works. Learned a ton about variable allocation overhad today :D
Upvotes: 5
Views: 2096
Reputation: 37929
Here is a rough estimate of the memory needed, based on the constants derived from your example. At a minimum you have to figure the Python internal object overhead for each split line, plus the overhead for each string.
It estimates 9.1 GB
to store the file in memory, assuming the following constants, which are off by a bit, since you're only using part of each line:
Code:
import sys
def sizeof(lst):
return sys.getsizeof(lst) + sum(sys.getsizeof(v) for v in lst)
GIG = 1024**3
file_size = 1.5 * GIG
lines = 31164015
num_cols = 4
avg_line_len = int(file_size / float(lines))
val = 'a' * (avg_line_len / num_cols)
lst = [val] * num_cols
line_size = sizeof(lst)
print 'avg line size: %d bytes' % line_size
print 'approx. memory needed: %.1f GB' % ((line_size * lines) / float(GIG))
Returns:
avg line size: 312 bytes
approx. memory needed: 9.1 GB
Upvotes: 3
Reputation: 887
I don't know about the analysis of the memory usage, but you might try this to get it to work without running out of memory. You'll sort into a new file which is accessed using a memory mapping (I've been led to believe this will work efficiently [in terms of memory]). Mmap has some OS specific workings, I tested this on Linux (very small scale).
This is the basic code, to make it run with a decent time efficiency you'd probably want to do a binary search on the sorted file to find where to insert the line otherwise it will probably take a long time.
You can find a file-seeking binary search algorithm in this question.
Hopefully a memory efficient way of sorting a massive file by line:
import os
from mmap import mmap
input_file = open('unsorted.txt', 'r')
output_file = open('sorted.txt', 'w+')
# need to provide something in order to be able to mmap the file
# so we'll just copy the first line over
output_file.write(input_file.readline())
output_file.flush()
mm = mmap(output_file.fileno(), os.stat(output_file.name).st_size)
cur_size = mm.size()
for line in input_file:
mm.seek(0)
tup = line.split("\t")
while True:
cur_loc = mm.tell()
o_line = mm.readline()
o_tup = o_line.split("\t")
if o_line == '' or tup[0] < o_tup[0]: # EOF or we found our spot
mm.resize(cur_size + len(line))
mm[cur_loc+len(line):] = mm[cur_loc:cur_size]
mm[cur_loc:cur_loc+len(line)] = line
cur_size += len(line)
break
Upvotes: 1