How can I randomize all lines in a textfile without having to save them to a variable?

Question

In my project to create a neural network that trains off of chess positions, I downloaded 70 million games from database.lichess.org and extracted the positions of every move from every game and saved the won, lost, and drawn positions to different files.

I can pretty much start training my neural network now, but if I start training now the positions will be clumped together by game - for example the first 90 positions (of a 45 move game after every halfmove) would be from the same game. This means that almost a whole iteration of training will be heavily biased towards the result of a certain game.

The obvious solution is to randomize each line in the textfile, but the only way I know how to do this is like so:

import random as rand


def shuffle_lines(textfile_location):
    textfile_lines_list = []

    with open(textfile_location, "r") as textfile:

        for line in textfile.readlines():
            textfile_lines_list.append(line)

    rand.shuffle(textfile_lines_list)
    
    with open(textfile_location, "w") as textfile:
        textfile.truncate()
        
        for line in textfile_lines_list:
            textfile.write(line)

With the amount of data that I am shuffling (~70'000'000 games * 70 halfmoves = ~4'900'000'000 positions), I am worried that this will take a ton of time because I first have to copy every item from the textfile to a list, then shuffle the list, then copy the list back to the textfile.

Is there any more efficient way to do this, for example shuffling the textfile without copying to a list first?

Abhinav Mathur · Accepted Answer

(Edit: Updated my answer to reflect @Maxijazz 's comment)

Instead of shuffling, here is an easier approach (here n is the number of lines in the current file):

Use numpy.random.permutation(n-1). This will return an array containing a random permutation of integers [0,1...,n-1]. You can simply use these elements in sequential order to create a "shuffling" effect.

How can I randomize all lines in a textfile without having to save them to a variable?

Answers (2)

Related Questions