I am iterating over 80m lines in a 2.5gb file to create a list of offsets for the location of the start of each line. The memory slowly increases as expected until I hit around line 40m, and then rapidly increases 1.5gb in 3-5 seconds before the process exits due to lack of memory. After some investigation, I discovered that the blow-up occurs around the time when the current offset (curr_offset) is around 2b, which happens to be around my sys.maxint (2^31-1). My questions are: Do numbers greater than sys.maxint require substantially more memory to store? If so, why? If not, why would I be seeing this behavior? What factors (e.g. which Python, which operating system) determine sys.maxint? On my 2010 MacBook Pro using 64-bit Python, sys.maxint is 2^63-1. On my Windows 7 laptop using 64-bit IronPython, sys.maxint is the smaller 2^31-1. Same with 32-bit Python. For various reasons, I can't get 64-bit Python on my Windows machine right now. Is there a better way to create this list of offsets? The code in question: f = open('some_file', 'rb') curr_offset = 0 offsets = [] for line in f: offsets.append(curr_offset) curr_offset += len(line) f.close()

Reputation: 31

When numbers are greater than Python's sys.maxint, do they require a lot more memory?

I am iterating over 80m lines in a 2.5gb file to create a list of offsets for the location of the start of each line. The memory slowly increases as expected until I hit around line 40m, and then rapidly increases 1.5gb in 3-5 seconds before the process exits due to lack of memory.

After some investigation, I discovered that the blow-up occurs around the time when the current offset (curr_offset) is around 2b, which happens to be around my sys.maxint (2^31-1).

My questions are:

Do numbers greater than sys.maxint require substantially more memory to store? If so, why? If not, why would I be seeing this behavior?
What factors (e.g. which Python, which operating system) determine sys.maxint?
- On my 2010 MacBook Pro using 64-bit Python, sys.maxint is 2^63-1.
- On my Windows 7 laptop using 64-bit IronPython, sys.maxint is the smaller 2^31-1. Same with 32-bit Python. For various reasons, I can't get 64-bit Python on my Windows machine right now.
Is there a better way to create this list of offsets?

The code in question:

f = open('some_file', 'rb')
curr_offset = 0
offsets = []
for line in f:
    offsets.append(curr_offset)
    curr_offset += len(line)
f.close()

Upvotes: 3

Answers (6)

Basile Starynkevitch

Reputation: 1

Yes. Above some threshold, python is representing long numbers as bignums and they take space.

Upvotes: 0

Eric O. Lebigot

Reputation: 94485

If you really can't use a 64-bit version of Python, you can hold your calculated offsets in a NumPy array of numpy.uint64 numbers (maximum value of 2**64-1). This is a little inconvenient, because you have to dynamically extend the array when it reaches capacity, but this would solve your problem and would run on any version of Python.

PS: A more convenient NumPy-based solution, that does not require dynamically updating the size of the NumPy array of offsets, is given in my other answer.

Upvotes: 0

Eric O. Lebigot

Reputation: 94485

Here is a solution that works even with 32-bit Python versions: store the lengths of the lines (they are small), convert into a NumPy array of 64-bit integers, and only then calculate offsets:

import numpy
with open('some_file', 'rb') as input_file:
    lengths = map(len, input_file)
offsets = numpy.array(lengths, dtype=numpy.uint64).cumsum()

where cumsum() calculates the cumulative sum of line lengths. 80 M lines will give a manageable 8*80 = 640 MB array of offsets.

The building of the lengths list can even be bypassed by building the array of lengths with numpy.fromiter():

import numpy
with open('some_file', 'rb') as input_file:
    offsets = numpy.fromiter((len(line) for line in input_file), dtype=numpy.uint64).cumsum()

I guess that it should be hard to find a faster method, because using a single numeric type (64-bit integers) makes NumPy arrays faster than Python lists.

Upvotes: 2

Alex Reynolds

Reputation: 96937

An offset in a 2.5 GB file should never need to be more than eight bytes. Indeed, a signed 64-bit integer is maximally 9223372036854775807, much greater than 2.5G.

If you have 80 million lines, you should require no more than 640 MB to store an array of 80M offsets.

I would investigate to see if something is buggy with your code or with Python, perhaps using a different container (perhaps an explicit numpy array of 64-bit integers), using a preinitialized list, or even a different language altogether to store and process your offsets, such as an off_t in C, with appropriate compile flags.

(If you're curious and want to look at demo code, I wrote a program in C called sample that stores 64-bit offsets to newlines in an input file, to be able to do reservoir sampling on a scale larger than GNU sort.)

Upvotes: 1

Mark Ransom

Reputation: 308206

Appending to a list will reallocate a buffer for the list once it passes the capacity of the current buffer. I don't know specifically how Python does it, but a common method is to allocate 1.5x to 2x the previous size of the buffer. This is an exponential operation, so it's normal to see the memory requirements increasing rapidly near the end. It might be the case that the size of the list is just too large overall; a quick test would be to replace curr_offset += len(line) with curr_offset += 1 and see if you have the same behavior.

Upvotes: 0

Alex Martelli

Reputation: 881685

Integers larger than sys.maxint will require more memory as they're stored as longs. If your sys.maxint is just 2GB, you're using a 32-bit build -- download, install, and use, a 64-bit build, and you'll avoid the problem. Your code looks fine!

Upvotes: 2

When numbers are greater than Python&#39;s sys.maxint, do they require a lot more memory?

Answers (6)

Related Questions

When numbers are greater than Python's sys.maxint, do they require a lot more memory?