Salvador Dali
Salvador Dali

Reputation: 222761

Find a period of eventually periodic sequence

Short explanation.

I have a sequence of numbers [0, 1, 4, 0, 0, 1, 1, 2, 3, 7, 0, 0, 1, 1, 2, 3, 7, 0, 0, 1, 1, 2, 3, 7, 0, 0, 1, 1, 2, 3, 7]. As you see, from the 3-rd value the sequence is periodic with a period [0, 0, 1, 1, 2, 3, 7].

I am trying to automatically extract this period from this sequence. The problem is that neither I know the length of the period, nor do I know from which position the sequence becomes periodic.

Full explanation (might require some math)

I am learning combinatorial game theory and a cornerstone of this theory requires one to calculate Grundy values of a game graph. This produces infinite sequence, which in many cases becomes eventually periodic.

I found a way to efficiently calculate grundy values (it returns me a sequence). I would like to automatically extract offset and period of this sequence. I am aware that seeing a part of the sequence [1, 2, 3, 1, 2, 3] you can't be sure that [1, 2, 3] is a period (who knows may be the next number is 4, which breaks the assumption), but I am not interested in such intricacies (I assume that the sequence is enough to find the real period). Also the problem is the sequence can stop in the middle of the period: [1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, ...] (the period is still 1, 2, 3).

I also need to find the smallest offset and period. For example for original sequence, the offset can be [0, 1, 4, 0, 0] and the period [1, 1, 2, 3, 7, 0, 0], but the smallest is [0, 1, 4] and [0, 0, 1, 1, 2, 3, 7].


My inefficient approach is to try every possible offset and every possible period. Construct the sequence using this data and check whether it is the same as original. I have not done any normal analysis, but it looks like it is at least quadratic in terms of time complexity.

Here is my quick python code (have not tested it properly):

def getPeriod(arr):
    min_offset, min_period, n = len(arr), len(arr), len(arr)
    best_offset, best_period = [], []
    for offset in xrange(n):
        start = arr[:offset]
        for period_len in xrange(1, (n - offset) / 2):
            period = arr[offset: offset+period_len]
            attempt = (start + period * (n / period_len + 1))[:n]

            if attempt == arr:
                if period_len < min_period:
                    best_offset, best_period = start[::], period[::]
                    min_offset, min_period = len(start), period_len
                elif period_len == min_period and len(start) < min_offset:
                    best_offset, best_period = start[::], period[::]
                    min_offset, min_period = len(start), period_len

    return best_offset, best_period

Which returns me what I want for my original sequence:

offset [0, 1, 4]
period [0, 0, 1, 1, 2, 3, 7]

Is there anything more efficient?

Upvotes: 12

Views: 6158

Answers (3)

Diego Orellana
Diego Orellana

Reputation: 1024

I had to do something similar once. I used brute force and some common sense, the solution is not very elegant but it works. The solution always works, but you have to set the right parameters (k,j, con) in the function.

  • The sequence is saved as a list in the variable seq.
  • k is the size of the sequence array, if you think your sequence will take long to become periodic then set this k to a big number.
  • The variable found will tell us if the array passed the periodic test with period j
  • j is the period.
  • If you expect a huge period then you must set j to a big number.
  • We test the periodicity by checking the last j+30 numbers of the sequence.
  • The bigger the period (j) the more we must check.
  • As soon as one of the test is passed we exit the function and we return the smaller period.

As you may notice the accuracy depends on the variables j and k but if you set them to very big numbers it will always be correct.

def some_sequence(s0, a, b, m):
    try:    
        seq=[s0]
        snext=s0
        findseq=True
        k=0
        while findseq:     
            snext= (a*snext+b)%m
            seq.append(snext)

#UNTIL THIS PART IS JUST TO CREATE THE SEQUENCE (seq) SO IS NOT IMPORTANT
            k=k+1
            if k>20000:
                # I IS OUR LIST INDEX
                for i in range(1,len(seq)):
                    for j in range(1,1000):
                        found =True
                        for con in range(j+30):
                          #THE TRICK IS TO START FROM BEHIND                   
                          if not (seq[-i-con]==seq[-i-j-con]):
                              found = False
                        if found:
                            minT=j
                            findseq=False
                            return minT

except:

    return None

simplified version

def get_min_period(sequence,max_period,test_numb):
    seq=sequence
    if max_period+test_numb > len(sequence):
        print("max_period+test_numb cannot be bigger than the seq length")
        return 1
    for i in range(1,len(seq)):       
        for j in range(1,max_period):
            found =True
            for con in range(j+test_numb):                                       
                if not (seq[-i-con]==seq[-i-j-con]):
                    found = False
            if found:           
                minT=j
                return minT

Where max_period is the maximun period you want to look for, and test_numb is how many numbers of the sequence you want to test, the bigger the better but you have to make max_period+test_numb < len(sequence)

Upvotes: 2

danbanica
danbanica

Reputation: 3038

Remark: If there is a period P1 with length L, then there is also a period P2, with the same length, L, such that the input sequence ends exactly with P2 (i.e. we do not have a partial period involved at the end).

Indeed, a different period of the same length can always be obtained by changing the offset. The new period will be a rotation of the initial period.

For example the following sequence has a period of length 4 and offset 3:

0 0 0 (1 2 3 4) (1 2 3 4) (1 2 3 4) (1 2 3 4) (1 2 3 4) (1 2

but it also has a period with the same length 4 and offset 5, without a partial period at the end:

0 0 0 1 2 (3 4 1 2) (3 4 1 2) (3 4 1 2) (3 4 1 2) (3 4 1 2)


The implication is that we can find the minimum length of a period by processing the sequence in reverse order, and searching the minimum period using zero offset from the end. One possible approach is to simply use your current algorithm on the reversed list, without the need of the loop over offsets.

Now that we know the length of the desired period, we can also find its minimum offset. One possible approach is to try all various offsets (with the advantage of not needing the loop over lengths, since the length is known), however, further optimizations are possible if necessary, e.g. by advancing as much as possible when processing the list from the end, allowing the final repetition of the period (i.e. the one closest to the start of the un-reversed sequence) to be partial.

Upvotes: 7

Spektre
Spektre

Reputation: 51873

  1. I would start with constructing histogram of the values in the sequence

    So you just make a list of all numbers used in sequence (or significant part of it) and count their occurrence. This is O(n) where n is sequence size.

  2. sort the histogram ascending

    This is O(m.log(m)) where m is number of distinct values. You can also ignore low probable numbers (count<treshold) which are most likely in the offset or just irregularities further lowering m. For periodic sequences m <<< n so you can use it as a first marker if the sequence is periodic or not.

  3. find out the period

    In the histogram the counts should be around multiples of the n/period. So approximate/find GCD of the histogram counts. The problem is that you need to take into account there are irregularities present in the counts and also in the n (offset part) so you need to compute GCD approximately. for example:

    sequence  = { 1,1,2,3,3,1,2,3,3,1,2,3,3 }
    

    has ordered histogram:

    item,count
    2    3
    1    4
    3    6
    

    the GCD(6,4)=2 and GCD(6,3)=3 you should check at least +/-1 around the GCD results so the possible periods are around:

    T = ~n/2 = 13/2 = 6
    T = ~n/3 = 13/3 = 4
    

    So check T={3,4,5,6,7} just to be sure. Use always GCD between the highest counts vs. lowest counts. If the sequence has many distinct numbers you can also do a histogram of counts checking only the most common values.

    To check period validity just take any item near end or middle of the sequence (just use probable periodic area). Then look for it in close area near probable period before (or after) its occurrence. If found few times you got the right period (or its multiple)

  4. Get the exact period

    Just check the found period fractions (T/2, T/3, ...) or do a histogram on the found period and the smallest count tells you how many real periods you got encapsulated so divide by it.

  5. find offset

    When you know the period this is easy. Just scan from start take first item and see if after period is there again. If not remember position. Stop at the end or in the middle of sequence ... or on some treshold consequent successes. This is up to O(n) And the last remembered position is the last item in the offset.

[edit1] Was curious so I try to code it in C++

I simplified/skip few things (assuming at least half of the array is periodic) to test if I did not make some silly mistake in my algorithm and here the result (Works as expected):

const int p=10;         // min periods for testing
const int n=500;        // generated sequence size
int seq[n];             // generated sequence
int offset,period;      // generated properties
int i,j,k,e,t0,T;
int hval[n],hcnt[n],hs; // histogram

// generate periodic sequence
Randomize();
offset=Random(n/5);
period=5+Random(n/5);
for (i=0;i<offset+period;i++) seq[i]=Random(n);
for (i=offset,j=i+period;j<n;i++,j++) seq[j]=seq[i];
if ((offset)&&(seq[offset-1]==seq[offset-1+period])) seq[offset-1]++;

// compute histogram O(n) on last half of it
for (hs=0,i=n>>1;i<n;i++)
    {
    for (e=seq[i],j=0;j<hs;j++)
     if (hval[j]==e) { hcnt[j]++; j=-1; break; }
    if (j>=0) { hval[hs]=e; hcnt[hs]=1; hs++; }
    }
// bubble sort histogram asc O(m^2)
for (e=1,j=hs;e;j--)
 for (e=0,i=1;i<j;i++)
  if (hcnt[i-1]>hcnt[i])
  { e=hval[i-1]; hval[i-1]=hval[i]; hval[i]=e;
    e=hcnt[i-1]; hcnt[i-1]=hcnt[i]; hcnt[i]=e; e=1; }
// test possible periods
for (j=0;j<hs;j++)
 if ((!j)||(hcnt[j]!=hcnt[j-1]))    // distinct counts only
  if (hcnt[j]>1)                    // more then 1 occurence
   for (T=(n>>1)/(hcnt[j]+1);T<=(n>>1)/(hcnt[j]-1);T++)
    {
    for (i=n-1,e=seq[i],i-=T,k=0;(i>=(n>>1))&&(k<p)&&(e==seq[i]);i-=T,k++);
    if ((k>=p)||(i<n>>1)) { j=hs; break; }
    }

// compute histogram O(T) on last multiple of period
for (hs=0,i=n-T;i<n;i++)
    {
    for (e=seq[i],j=0;j<hs;j++)
     if (hval[j]==e) { hcnt[j]++; j=-1; break; }
    if (j>=0) { hval[hs]=e; hcnt[hs]=1; hs++; }
    }
// least count is the period multiple O(m)
for (e=hcnt[0],i=0;i<hs;i++) if (e>hcnt[i]) e=hcnt[i];
if (e) T/=e;

// check/handle error
if (T!=period)
    {
    return;
    }

// search offset size O(n)
for (t0=-1,i=0;i<n-T;i++)
 if (seq[i]!=seq[i+T]) t0=i;
t0++;

// check/handle error
if (t0!=offset)
    {
    return;
    }

Code is still not optimized. For n=10000 it takes around 5ms on mine setup. The result is in t0 (offset) and T (period). You may need to play with the treshold constants a bit

Upvotes: 6

Related Questions