GLHF
GLHF

Reputation: 4035

Python - file processing - memory error - speed up the performance

I'm dealing with huge numbers. I have to write them into a .txt file. Right now I have to write the all numbers between 1000000,10000000(1M-1B) into a .txt file. Since it throws me memory error if I do it in a single list, I sliced them ( I don't like this solution but couldn't find any other ).

The problem is, even with the first 50M numbers (1M-50M), I can't even open the .txt file. It's 458MB and took around + 15 mins, so I guess it'll be around a 9GB .txt file and +4 hours if I write all numbers.

When I try to open the .txt file contains numbers between 1M-50M

myfile.txt has stopped working

So right now the file contains the numbers between 1M-50M and I can't even open it, I guess if I write all numbers it's impossible to open.

I have to shuffle numbers between 1M-1B and store this numbers into a .txt file right now. Basically it's a freelance job and I'll have to deal with bigger numbers like 100B etc. Even first 50M has this problem, I don't know how to finish when the numbers are bigger.

Here are the codes for 1M-50M

import random

x = 1000000
y = 10000000


while x < 50000001:
    nums = [a for a in range(x,x+y)]
    random.shuffle(nums)
    with open ("nums.txt","a+") as f:
        for z in nums:
            f.write(str(z)+"\n")
        x += 10000000

How can I speed up this process?

How can I open this .txt file, should I create new file every time? If I choose this option I have to slice the numbers more since even 50M numbers has problem.

Is there any module can you suggest may be useful for this process?

Upvotes: 1

Views: 293

Answers (2)

Anton&#237;n Lejsek
Anton&#237;n Lejsek

Reputation: 6103

I would not help You with the Python, but if You need to shuffle a consecutive sequence, You can improve the shuffling algorithm. Make a bit array of 1E9 items, if would be about 125MB. Generate random number. If it is not present in the bit array, add it there and write it to the file. Repeat until You have 99% of numbers in the file.

Now convert the unused numbers in bit array into ordinary array - it would be 80MB. Shuffle them and write to the file.

You needed about 200MB of memory for 1E9 items (and 8 minutes, written in C#). You should be able to shuffle 100E9 items in 20GB of RAM and less than a day.

Upvotes: 0

Maximilian Peters
Maximilian Peters

Reputation: 31669

Is there any module can you suggest may be useful for this process?

Using Numpy is really helpful for working with large arrays.

How can I speed up this process?

Using Numpy's functions arange and tofile dramatically speed up the process (see code below). Generation of the initial array is about 50 times faster and writing the array to a file is about 7 times faster.

The code just performs each operation once (change number=1 to a higher value to get better accuracy) and only generates number up to between 1M and 2M but you can see the general picture.

import random
import timeit
import numpy

x = 10**6
y = 2 * 10**6

def list_rand():
    nums = [a for a in range(x, y)]
    random.shuffle(nums)
    return nums

def numpy_rand():
    nums = numpy.arange(x, y)
    numpy.random.shuffle(nums)
    return nums

def std_write(nums):
    with open ('nums_std.txt', 'w') as f:
        for z in nums:
            f.write(str(z) + '\n')

def numpy_write(nums):
    with open('nums_numpy.txt', 'w') as f:
        nums.tofile(f, '\n')

print('list generation, random [secs]')
print('{:10.4f}'.format(timeit.timeit(stmt='list_rand()', setup='from __main__ import list_rand', number=1)))

print('numpy array generation, random [secs]')
print('{:10.4f}'.format(timeit.timeit(stmt='numpy_rand()', setup='from __main__ import numpy_rand', number=1)))

print('standard write [secs]')
nums = list_rand()
print('{:10.4f}'.format(timeit.timeit(stmt='std_write(nums)', setup='from __main__ import std_write, nums', number=1)))

print('numpy write [secs]')
nums = numpy_rand()
print('{:10.4f}'.format(timeit.timeit(stmt='numpy_write(nums)', setup='from __main__ import numpy_write, nums', number=1)))



list generation, random [secs]
    1.3995
numpy array generation, random [secs]
    0.0319
standard write [secs]
    2.5745
numpy write [secs]
    0.3622

How can I open this .txt file, should I create new file every time? If I choose this option I have to slice the numbers more since even 50M numbers has problem.

It really depends what you are trying to do with the numbers. Find their relative position? Delete one from the list? Restore the array?

Upvotes: 1

Related Questions