Wayne Workman
Wayne Workman

Reputation: 541

Genetic Algorithm - Unordered variable length chromosomes - Crossover strategies?

I'm working on a genetic algorithm. The chromosomes are not ordered - meaning the order in which they appear in a member does not affect that members score. Also the number of chromosomes are not fixed. One member might have 1 chromosome, another may have over 100.

I'm working in Python and the chromosomes are stored in lists. Below is a simplified example of the structure:

member = [{"key1":"value","key2":"value"},{"key1":"value","key2":"value"},{"key1":"value","key2":"value"}]

Two example members (simplified) might be:

member1 = [{"a":1.5,"b":2.334563},{"a":769.0003413,"b":0.00023}]
member2 = [{"a":7,"b":432.993246927},{"a":99,"b":532.234},{"a":21,"b":712.2},{"a":432,"b":999.9999},{"a":932,"b":12}]

Repeating chromosomes that are out of order is OK in the application:

member3 = [{"a":1,"b":1},{"a":2,"b":2},{"a":2,"b":2},{"a":1,"b":1}]

Members

Each chromosome in a member is a mathematical function that takes a Unix epoc timestamp as an input and it outputs a value. This allows me to get 'the value' for any time using that member's function. The keys in the chromosomes are always the same - but the values are randomly generated during initial seeding values range from 0 to 100 with up to 100 decimal places.

Grading system

I'm grading the functions against real time-series data I have in an SQL database. The time-series data is being constantly updated with new values every 1 to 3 seconds. When I do a select on this data, I select where the epoc value is greater than current epoc - 5 seconds and order by descending, and limit the output to 1 row. I take that actual epoc value returned and it's one of the points I grade against.

I take all the points (epoc:value pairs) and use those to grade, feeding the member function the epoc, getting the member's value, and then subtracting that value from the real value - and take the absolute value of that.

It looks somewhat like this:

total = 0
for chromosome in member[chromosomes]:
    for epoc in epocs:
        thisValue = Calc(epoc,chromosome)
        total = total + abs(thisValue - getRealValue(epoc))

The function Calc takes the chromosome and the epoc value and outputs a float.

Zero is a perfect score. The higher the score, the worse the member. I average all the members scores and remove ones below average.

I've tried grading against a static set of data from the database, and I've tried dynamically grading against the last 24 hours - meaning the last 24 hours is always different as time flows. I've also tried the last 4 hours, last hour, and last 3 days.

Mutation System

I have set my mutation rate at 2%, but I've played around at higher percents with worse results. Only children are potentially mutated, not existing population (want to keep the elites). When a child is selected for mutation, it's values in it's chromosomes are randomly shifted by (addition or subtraction randomly) a decimal between 0 and 1 of up to 100 decimal places. This provides the slightest of change to the child's values - because a very slight change drastically alters the output of a chromosome's function.

My problem

The crossover methods I'm using right now result in premature convergence.

Crossover strategies I've tried

I've tried taking a random number of random chromosomes from each parent. I've tried taking the first half of the first parent and last half of the second parent. I've tried these methods so far:

# Number of chromosomes from parent 1.
parent1chromosomes = randomNumber(0,len(parent1['chromosomes']))

# Number of chromosomes from parent 2.
parent2chromosomes = randomNumber(0,len(parent2['chromosomes']))

child = {}
child['chromosomes'] = []

# Get parent 1 chromosomes into child.
for i in range(0,parent1chromosomes):
    child['chromosomes'].append(random.choice(parent1['chromosomes']))

# Get parent 2 chromosomes into child.
for i in range(0,parent2chromosomes):
    child['chromosomes'].append(random.choice(parent2['chromosomes']))

Note: randomNumber is a function that returns a random integer between the specified ranges.

Both attempts result in early convergence. The problem I'm trying to solve is extremely complex - I've tried population sizes from 10,000 to 1,000,000 so far.

Example performance

Here's a screen-shot of a recent run. I'm mapping the best scores (lowest) and the average scores. In this photo, it's graphing five different population's best and averages over time. These particular 5 populations are each 10,000 members using a 3 second sampling of real data, and are scoring against the last hour of real data dynamically - which is why the best gets worse - because the real data it's being graded against changed in a way that makes the best member worse. The best scores are off by thousands, which is radically inaccurate. Smaller populations result in faster early convergence.

enter image description here

My question

What are other ways to better handle crossover with variable length members where the order of the chromosomes do not matter and repeating chromosomes do not matter?

Upvotes: 4

Views: 1085

Answers (2)

Patrick Trentin
Patrick Trentin

Reputation: 7342

disclaimer: this answer might incur in successive refinements based on your feedback.


So, we want to fix two problems here:

  • fast convergence: the population diversity decreases too quickly
  • genetic obsolescence: the fitness function changes over time, which makes early successful individuals potential loosers in the long run and early unsuccessful individual potential winners in the long run.

In your scenario, at each round you retain the best individuals within your population. In principle, this is normally a good idea: one does not want to loose the best approximation of the optimal solution. However, the more individuals you keep from one round to the next (wrt. the global population), the quicker the diversity within the population decreases. This is because keeping an individual alive gives his genome a higher chance to spread, and this happens at an exponential rate across multiple rounds. Thus, the fraction of individuals you keep alive from one round to the next should either be very small, compared to the whole population, or zero.

Alternatively, you can compensate for fast-convergence by strengthening your mutation rate so that it results in higher diversity. On this regard, you may want to consider having two different mutation approaches:

  • strong-mutation: this changes some value in an arbitrary fashion so to introduce (or re-introduce) in the population a gene that was not already available. There are several ways to do this: 1. a (or more than one) new key-value pair is arbitrarily dropped or introduced in the genome of a child 2. the key or the value of an existing key-value pair is changed in an arbitrary fashion, similarly to what you are doing now by flipping a bit

  • weak-mutation: given the nature of your fitness function, which changes its score evaluation over time, it might be sensible to arbitrarily alter some numerical values using a % calculation, e.g. a key-value pair is increased/decreased by 0.00000001%. This should make it easier for your population to adjust to the flow of time, though I must mention that the change rate one chooses should be very carefully chosen so to not dominate the search or make it unstable.

The true elephant in the room here is genetic obsolescence. Imagine your problem was discrete rather than continuum, and you had to find the best-individual for fixed points in time rather than one that "evolves" along the search due to a changing fitness function. Then, in the former case, what one would do is to run N separate searches with different weights for the fitness function. Each time, one would start from a completely random population, and then allow the genetic algorithm to converge at an optimal solution. Take now a very large N, large enough so that the time it takes for the search to converge overlaps with several next points in time you want of evaluate, and try to overlap the overall population that is retained in all of your searches.. what do you get? You get a population with a very high degree of diversity, because several overlapping searches have just barely started!

So, if you now want to extend the discrete case to the continuum, you must replicate the same situation: at each round, or after a fixed (and small) number of rounds, you should generate a new set of random individuals, just like at the initialization step, and give them a chance to breed too. This needs to be done carefully, since the new population is completely random, it might get completely overwhelmed by the existing one in just a few rounds. An idea could be to let the new random pool of individuals being improved for some rounds, in a safe haven, before being evaluated against the main pool of individuals. After some rounds, the two pools can be allowed to breed with one another and then merged in a unique population, so that a new random set of individuals can be created.

Upvotes: 1

Dan D.
Dan D.

Reputation: 74645

Rather than:

for i in range(0,parent1chromosomes):
    child['chromosomes'].append(random.choice(parent1['chromosomes']))

Perhaps:

child['chromosomes'].extend(random.sample(parent1['chromosomes'], parent1chromosomes))

This would mean that you can only get duplicate chromosomes if you get them both from one parent or you get one copy from both parents.

Upvotes: 3

Related Questions