Reputation: 185
In my project to create a neural network that trains off of chess positions, I downloaded 70 million games from database.lichess.org and extracted the positions of every move from every game and saved the won, lost, and drawn positions to different files.
I can pretty much start training my neural network now, but if I start training now the positions will be clumped together by game - for example the first 90 positions (of a 45 move game after every halfmove) would be from the same game. This means that almost a whole iteration of training will be heavily biased towards the result of a certain game.
The obvious solution is to randomize each line in the textfile, but the only way I know how to do this is like so:
import random as rand
def shuffle_lines(textfile_location):
textfile_lines_list = []
with open(textfile_location, "r") as textfile:
for line in textfile.readlines():
textfile_lines_list.append(line)
rand.shuffle(textfile_lines_list)
with open(textfile_location, "w") as textfile:
textfile.truncate()
for line in textfile_lines_list:
textfile.write(line)
With the amount of data that I am shuffling (~70'000'000 games * 70 halfmoves = ~4'900'000'000 positions), I am worried that this will take a ton of time because I first have to copy every item from the textfile to a list, then shuffle the list, then copy the list back to the textfile.
Is there any more efficient way to do this, for example shuffling the textfile without copying to a list first?
Upvotes: 0
Views: 52
Reputation: 8101
(Edit: Updated my answer to reflect @Maxijazz 's comment)
Instead of shuffling, here is an easier approach (here n
is the number of lines in the current file):
Use numpy.random.permutation(n-1)
. This will return an array containing a random permutation of integers [0,1...,n-1]. You can simply use these elements in sequential order to create a "shuffling" effect.
Upvotes: 1
Reputation: 496
I would like to suggest a different approach:
on neural networks, if you have a bias in the beginning of your training, there are usually 2 things to be done:
increase batch size (less bias for each game in the batch)
decrease or change the learning rate (smaller weight changes will occur in the beginning)
Upvotes: 1