Hurry Blob
Hurry Blob

Reputation: 140

algorithm to find duplicated byte sequences

Hello everyone who read the post. Help me please to resolve a task: On input I have an array of bytes. I need to detect duplicated sequences of bytes for the compressing duplicates. Can anyone help me to find acceptable algorithm?

Upvotes: 1

Views: 200

Answers (2)

antonpuz
antonpuz

Reputation: 3316

Usually this type of problems come with CPU/memory tradeoffs. The extreme solutions would be (1) and (2) below, you can improve from there:

  1. High CPU / low memory - iterate over all possible combination sizes and letters and find duplicates (2 for statements)
  2. Low CPU / high memory - create lookup table (hash map) of all required combinations and lengths, traverse the array and add to the table, later traverse the table and find your candidates
  3. Improve from here - want ideas from (2) with lower memory, decrease lookup table size to make more hits and handle the lower problem later. want faster lookup for sequence length, create separate lookup table per length.

Upvotes: 2

Ripi2
Ripi2

Reputation: 7198

Build a tree with branches are all type of bytes (256 branches).
Then traverse the array, building new sub-branches. At each node, store a list of positions where this sequence is found.

For example: Let's say you are at node AC,40,2F. This sequence in the tree means: "Byte AC was found at position xx (one of its positions stored in that node). The next byte, 40, was at position yy=xx+1 (among others). The byte 2F was at position zz=yy+1

Now you want to "compress" only sequences of some size (e.g. 5). So traverse the tree an pay attention to depths 5 or more. In the 5th-deep subnode of a node you have already stored all positions where such sequence (or greater) is found in the array. Those positions are those you are interested to store in your compressed file.

Upvotes: 1

Related Questions