Mats
Mats

Reputation: 165

Fast Evolutionary Algorithm Mutation: Mutating n random genes instead of evaluating every gene

I'm trying to optimize the code for my genetic algorithm. The DNA is currently a dictionary to increase lookup speeds for fitness calculations, but can be easily changed to a numpy array.

The mutation rate is supposed to be 1/L, L being the length of the DNA.

This solution works, but it's quite slow:

# Iterate through genome, flip a gene with a probability of 1/L
def mutate(self):
      self.dna = dict(
        [(i, flip(self.dna[i])) if random.randint(0,num_genes) < 1 
        else (i, self.dna[i]) for i in range(num_genes)]
        )

This solution is about twice as fast, but for some reason it produces much worse results:

# Select n number of genes calculated by 1/L, then change n random genes
def mutate(self):
      num_mutations = sum(np.random.choice([0,1], num_genes, p=[(num_genes-1)/num_genes, 1/num_genes]))
      for i in np.random.choice(num_genes-1, num_mutations):
        self.dna[i] = flip(self.dna[i])

As far as I can tell they mutate the same number of genes, and the output should be the same. 10 runs of both with set random seeds show that the latter approach results in way worse fitness results however.

Why does the second approach result in worse dna fitness? How do these approaches differ in their outcome?

Thank you for any help.

Upvotes: 0

Views: 370

Answers (3)

jcanedo
jcanedo

Reputation: 35

In the first example you are potentially mutating your genes more than num_mutations times. While your first method has on average num_mutations mutations, the fact that it sometimes is greater can be leveraged by your crossover function which will potentially contribute that more diverse gene into the pool. Ideally, your fitness function is able to discard bad permutations while capitalizing on potentially more diverse candidates that diversify the pool.

Additionally, mutating in chunks will tend to produce faster yet worse results. Because mutations are not homogenized as smoothly across the gene, then it is possible that a certain chunk will be mutated such that it skews the pool fitnesses and can potentially be disproportionately picked. Because the mutations are not homogenized, you are potentially missing out on the work that is done by mutating other genes that may not have as much impact on the fitness. This is one of the biggest pitfalls using exponential fitness functions, if you are doing that, as this will exponentiate this discrepancy. If you were to homogenize the random choice over the gene, this potentially much fitter gene would have a much more diverse contribution to the solution space search. There are several ways to easily get by this such as forcing fittest solutions to breed with other pool members without gene replacement if there are enough genes in a population.

Upvotes: 0

Peter
Peter

Reputation: 13515

First off: the vectorization

There's no point in using a dict when your indices are integers - looking up an integer index is always faster than using a hash-table. Also you can vectorize this using numpy - make your self.dna a numpy array instead of a list or a dict, which may make for a 10x-100x speedup.

def flip(x):  # x is a vector of genes, lets a binary array here
    return ~x
mutation_mask = np.random.rand(n_genes)<1./len(self.dna)
self.dna[mutation_mask] = flip(dna[mutation_mask])

Second off: Why your two algorithms are different:

I don't know, they look like they should be the same statistically. The one thing I can think is that in the second you're modifying self.dna in place with self.dna[i]=..., rather than reassigning self.dna=..., so any other area in the code that has a the old self.dna will have their copy changed too in the second case. You could fix this by inserting self.dna = self.dna.copy() before your for-loop in the second example.

Upvotes: 1

Philippe
Philippe

Reputation: 245

The complexity is due to the multiple calls to random: You are calling it for each gene.

Your second example does the same but this time they are all called in the same function.

A way to drastically improve the performance is to reduce the number of random calls. For that, you can use some math to know beforehand how many mutations a genome will receive, the formula is the following

P(L, n, p) # probability of modifying n-times a genome of size L with a succes p (here p is 1/L)
P(L, n, p) = binomial(n, L) * p**n * (1-p)**(L-n)

If you are not too familiar with math, here is a python function that will do that for you:

def binomial(n, k):
    if 0 <= k <= n:
        ntok = ktok = 1
        for t in range(1, min(k, n - k) + 1):
            ntok *= n; ktok *= t; n -= 1
        return ntok // ktok
    else: return 0

def P(L, n, p): return binomial(L, n) * p**n * (1-p)**(L-n)

And now you can pre-compute that and keep it in a list:

proba = [P(L, i, 1/L) for i in range(0, L+1)]

Also I will recommand partially summing it for easier use of random

probaS = [sum(proba[:k]) for k in range(0, L+1)] + [1]

Now you can generate only one random number and you will directly know how many mutations you need for this genome:

r = random()
i = 0
while r > probaS[i]: i += 1

At the end of the loop, i-1 will tell you how many mutations are needed.

Now you just have to select randomly i-1 different part of the genome and this is done! You went from L random calls to only 2 or 3 on average.

On the basic test I conducted, the time complexity for L=50 and 100,000 genomes went from 5.74s to 196ms so about 30 times faster.

My answer is a bit technical so feel free to ask if it wasn't clear.

Upvotes: 1

Related Questions