Reputation: 11
I am trying to make a Deep-Q-network that teaches itself to play modified versions of tictactoe (a m,n,k-game)
I want to make sure the network does not place a mark where there already is a mark
I currently have two ideas for it:
I'm pretty sure both would work, but which one will be more efficient while training?
Trying option 1, but i'm not sure the Q-values for 'ilegal grids' are getting smaller, and each episode seems to take too long
Upvotes: 1
Views: 43