Reputation: 140
Hello everyone who read the post. Help me please to resolve a task: On input I have an array of bytes. I need to detect duplicated sequences of bytes for the compressing duplicates. Can anyone help me to find acceptable algorithm?
Upvotes: 1
Views: 200
Reputation: 3316
Usually this type of problems come with CPU/memory tradeoffs. The extreme solutions would be (1) and (2) below, you can improve from there:
Upvotes: 2
Reputation: 7198
Build a tree with branches are all type of bytes (256 branches).
Then traverse the array, building new sub-branches. At each node, store a list of positions where this sequence is found.
For example: Let's say you are at node AC,40,2F
. This sequence in the tree means: "Byte AC was found at position xx (one of its positions stored in that node). The next byte, 40, was at position yy=xx+1
(among others). The byte 2F was at position zz=yy+1
Now you want to "compress" only sequences of some size (e.g. 5). So traverse the tree an pay attention to depths 5 or more. In the 5th-deep subnode of a node you have already stored all positions where such sequence (or greater) is found in the array. Those positions are those you are interested to store in your compressed file.
Upvotes: 1