Tilak Raj Singh
Tilak Raj Singh

Reputation: 353

Algo to find redundant data in a file

I have a binary file where one record is repeated multiple times. The file only consists of this record but may be repeated for a number of times.

I dont know the size of the record. What is the best algorithm to extract the record and know how many times it is repeated.

For example suppose I have a file with following memory representation in hex. (ignore file headers and all stuff)

3F 5C BA 3F 5C BA 3F 5C BA 3F 5C BA 3F 5C BA 3F 5C BA 3F 5C BA 3F 5C BA 3F 5C BA 3F 5C BA 3F 5C BA 3F 5C BA 3F 5C BA 3F 5C BA 3F 5C BA 3F 5C BA

so here my record is 3F 5C BA of 3 bytes and it is repeated 15 times here.

How can I get these values (sizeof the record and the number of times its repeated). Can be done using Rabin Karp but is there any other better and efficient way to do it.

Upvotes: 0

Views: 56

Answers (3)

lex82
lex82

Reputation: 11317

  1. Start with the assumption that the length l of your record is 1
  2. Check if your assumption is correct by comparing all subsequent blocks of size l. Stop as soon as you find a mismatch.
  3. If no mismatch is found, you are finished. RETURN.
  4. Search for the next occurrence of the block with length l. This gives you another candidate record length. If the next matching block starts at index i (zero based), set l = i and go to step 2.

If you know that there is always a solution, you might be able to speed up step 2 a bit. If you checked 50% of the data, you can stop.

Note: This answer assumes that you are looking for the shortest possible record. If all your bytes are for instance FF, could find a lot of other solutions than l=1 (e.g. only one big record).

Example: Start with a record of size 1, in your case 3F. Then check whether this is the complete record by checking whether all subsequent bytes are 3F as well. You can stop with the next byte because it differs. Now look for the next 3F. It occurs at index 3 (zero based). Now you know your record is at least 3 bytes long. Assume your record is 3 bytes long. Check if all subsequent three byte blocks match your record. Done!

Upvotes: 0

Pavel
Pavel

Reputation: 1

You can look at suffix trees, you can insert all suffixes of your string into a suffix tree and count the number of times a certain substring occurs, then do tree traversal and find your answer.

enter image description here

Upvotes: 0

Jim Mischel
Jim Mischel

Reputation: 134045

One possibility is to take the size of the file and factor it. For example, if the file size was 1280, then you know that the record size is one of the following:

1,2,4,5,8,10,16,20,32,40,64,80,128,160,256,320,640,1280

You could then test each of those assumptions until you find a match or exhaust the possibilities.

Of course, this assumes that the file is not truncated or otherwise corrupted.

That's probably not the most efficient way to do it, but it's quick to code and could work quite fast enough for your purposes. It rather depends on how large your files are and how often you'll want to do this. Sometimes the brute force solution is the right solution, even if it's not the "best" solution.

Upvotes: 1

Related Questions