Arash
Arash

Reputation: 11

calculating Lempel-Ziv (LZ) complexity (aka sequence complexity) of a binary string

I need to calculate the LZ-complexity of a binary string. The LZ-complexity is the number of differencet substrings encountered as the stream is viewed from begining to the end. As an example:

s = 1001111011000010

Marking in the different substrings the sequence complexity c(s) = 6: s = 1 / 0 / 01 / 1110 / 1100 / 0010 /

can someone guide me to find a simple solution for that? I am sure there should be some very straight-forward implementations for this well-known problem, but I have difficulty finding them. Can it be done simply done with constructing a suffix tree or something similar. If yes, exactly how? and what should I do?

anyone knows of any c/c++ source code to accomplish the task?

thanks in advance.

to clarify the construction of the tree suggested in the answers. Does the tree looks like this?

         o
       /   \
      o     o
     / \   / \
    o   o o   o
       /     /
      o     o

Upvotes: 0

Views: 6296

Answers (7)

Guillermo Valle has an implementation that produces the right answers (unlike e.g. the current Wikipedia code).

For instance,

Complexity of 0001 is 2: 0 001

Complexity of 010 is 3: 0 1 0

Upvotes: 1

J.R.
J.R.

Reputation: 1

This may be relevant for you. It is parallel implementation of an algorithm LZMP that computes the LZ-complexity in CUDA and runs on nVidia GPU.

http://www.ariel.ac.il/sites/ratsaby/Code/LZMP.zip

Upvotes: 0

rgiglio
rgiglio

Reputation: 41

This should do the trick in Python (from: Kaspar, F. Schuster, H. Easily calculable measure for the complexity of spatiotemporal patterns. Physical Review A, vol 36, n. 2, p 842.)

#!/usr/bin/python


def lz_complexity(s):
    i, k, l = 0, 1, 1
    k_max = 1
    n = len(s) - 1
    c = 1
    while True:
        if s[i + k - 1] == s[l + k - 1]:
            k = k + 1
            if l + k >= n - 1:
                c = c + 1
                break
        else:
            if k > k_max:
               k_max = k
            i = i + 1
            if i == l:
                c = c + 1
                l = l + k_max
                if l + 1 > n:
                    break
                else:
                    i = 0
                    k = 1
                    k_max = 1
            else:
                k = 1
    return c


def main():
    lz = lz_complexity('1001111011000010')
    assert lz == 6 
    print lz


if __name__ == '__main__':

    main()

Upvotes: -1

Deekshith
Deekshith

Reputation: 21

@Arash and @Sanchit Gupta: You might've got confused between LZ76 complexity and LZ78 complexity. The one Arash is refering to is LZ76 complexity and the other one is LZ78 complexity. You can refer to section-3 of the paper "Estimating the Entropy Rate of Spike Trains via Lempel-Ziv Complexity".

Upvotes: 2

Sanchit Gupta
Sanchit Gupta

Reputation: 11

1 0 01 11 10 110 00 010
Complexity of sequence is 8 because the partitions are 8 not 6 - 1/0/01/11/10/110/00/010

Upvotes: 1

Taliadon
Taliadon

Reputation: 438

Below is a quick example of how to compute LZ-Complexity using a tree. For convenience - mine; not yours - this code implements a fixed-sized pre-allocated tree, and is a prime example of why void* pointers are ugly to use and difficult to maintain. Hand this code in as is, and your lecturer is likely to shoot you in the face :)

#include <stdlib.h>
#include <stdio.h>

int LZComplexity(char *p_binarySequence, int p_maxTreeNodes)
{
 void **patternTree;
 void **currentNode;
 void **nextFreeNode;
 int nodeCount;
 int sequenceIndex;
 int currentDigit;

 nodeCount = 0;
 patternTree = malloc(sizeof(void*) * (p_maxTreeNodes << 1));
 currentNode = patternTree;
 nextFreeNode = patternTree + (sizeof(void*) << 1);
 currentNode[0] = NULL;
 currentNode[1] = NULL;
 sequenceIndex = 0;

 while (p_binarySequence[sequenceIndex])
 {
  currentDigit = p_binarySequence[sequenceIndex] - 48;
  if (NULL == currentNode[currentDigit])
  {
   currentNode[currentDigit] = nextFreeNode;
   nextFreeNode[0] = NULL;
   nextFreeNode[1] = NULL;
   nextFreeNode += (sizeof(void*) << 1);
   currentNode = patternTree;
   nodeCount++;
  }
  else
  {
   currentNode = currentNode[currentDigit];
  }
  sequenceIndex++;
 }

 free(patternTree);
 return nodeCount;
}


int main(int argc, char *argv[])
{
 printf("%u\n", LZComplexity("10100101001011101011", 1000));
 return 0;
}

Upvotes: 1

Justin Morgan
Justin Morgan

Reputation: 2435

Create a binary tree where left is 0 and right is 1. For each bit, try to find the sequence in the tree. If it's there, concatenate the next bit, rinse, repeat. If it's not there, add it to the tree and continue. LZ Complexity is the total number of paths in the tree (not just # leaf nodes).

By the way, is this homework?

Upvotes: 0

Related Questions