Reputation: 31
I am iterating over 80m lines in a 2.5gb file to create a list of offsets for the location of the start of each line. The memory slowly increases as expected until I hit around line 40m, and then rapidly increases 1.5gb in 3-5 seconds before the process exits due to lack of memory.
After some investigation, I discovered that the blow-up occurs around the time when the current offset (curr_offset) is around 2b, which happens to be around my sys.maxint (2^31-1).
My questions are:
The code in question:
f = open('some_file', 'rb')
curr_offset = 0
offsets = []
for line in f:
offsets.append(curr_offset)
curr_offset += len(line)
f.close()
Upvotes: 3
Views: 2670
Reputation: 1
Yes. Above some threshold, python is representing long numbers as bignums and they take space.
Upvotes: 0
Reputation: 94485
If you really can't use a 64-bit version of Python, you can hold your calculated offsets in a NumPy array of numpy.uint64
numbers (maximum value of 2**64-1). This is a little inconvenient, because you have to dynamically extend the array when it reaches capacity, but this would solve your problem and would run on any version of Python.
PS: A more convenient NumPy-based solution, that does not require dynamically updating the size of the NumPy array of offsets, is given in my other answer.
Upvotes: 0
Reputation: 94485
Here is a solution that works even with 32-bit Python versions: store the lengths of the lines (they are small), convert into a NumPy array of 64-bit integers, and only then calculate offsets:
import numpy
with open('some_file', 'rb') as input_file:
lengths = map(len, input_file)
offsets = numpy.array(lengths, dtype=numpy.uint64).cumsum()
where cumsum()
calculates the cumulative sum of line lengths. 80 M lines will give a manageable 8*80 = 640 MB array of offsets.
The building of the lengths
list can even be bypassed by building the array of lengths with numpy.fromiter()
:
import numpy
with open('some_file', 'rb') as input_file:
offsets = numpy.fromiter((len(line) for line in input_file), dtype=numpy.uint64).cumsum()
I guess that it should be hard to find a faster method, because using a single numeric type (64-bit integers) makes NumPy arrays faster than Python lists.
Upvotes: 2
Reputation: 96937
An offset in a 2.5 GB file should never need to be more than eight bytes. Indeed, a signed 64-bit integer is maximally 9223372036854775807, much greater than 2.5G.
If you have 80 million lines, you should require no more than 640 MB to store an array of 80M offsets.
I would investigate to see if something is buggy with your code or with Python, perhaps using a different container (perhaps an explicit numpy array of 64-bit integers), using a preinitialized list, or even a different language altogether to store and process your offsets, such as an off_t
in C, with appropriate compile flags.
(If you're curious and want to look at demo code, I wrote a program in C called sample
that stores 64-bit offsets to newlines in an input file, to be able to do reservoir sampling on a scale larger than GNU sort
.)
Upvotes: 1
Reputation: 308206
Appending to a list will reallocate a buffer for the list once it passes the capacity of the current buffer. I don't know specifically how Python does it, but a common method is to allocate 1.5x to 2x the previous size of the buffer. This is an exponential operation, so it's normal to see the memory requirements increasing rapidly near the end. It might be the case that the size of the list is just too large overall; a quick test would be to replace curr_offset += len(line)
with curr_offset += 1
and see if you have the same behavior.
Upvotes: 0
Reputation: 881685
Integers larger than sys.maxint
will require more memory as they're stored as longs
. If your sys.maxint
is just 2GB, you're using a 32-bit build -- download, install, and use, a 64-bit build, and you'll avoid the problem. Your code looks fine!
Upvotes: 2