Reputation: 353
I have a binary file where one record is repeated multiple times. The file only consists of this record but may be repeated for a number of times.
I dont know the size of the record. What is the best algorithm to extract the record and know how many times it is repeated.
For example suppose I have a file with following memory representation in hex. (ignore file headers and all stuff)
3F 5C BA 3F 5C BA 3F 5C BA 3F 5C BA 3F 5C BA 3F 5C BA 3F 5C BA 3F 5C BA 3F 5C BA 3F 5C BA 3F 5C BA 3F 5C BA 3F 5C BA 3F 5C BA 3F 5C BA 3F 5C BA
so here my record is 3F 5C BA of 3 bytes and it is repeated 15 times here.
How can I get these values (sizeof the record and the number of times its repeated). Can be done using Rabin Karp but is there any other better and efficient way to do it.
Upvotes: 0
Views: 56
Reputation: 11317
l
of your record is 1l
. Stop as soon as you find a mismatch.l
. This gives you another candidate record length. If the next matching block starts at index i
(zero based), set l = i
and go to step 2.If you know that there is always a solution, you might be able to speed up step 2 a bit. If you checked 50% of the data, you can stop.
Note: This answer assumes that you are looking for the shortest possible record. If all your bytes are for instance FF, could find a lot of other solutions than l=1
(e.g. only one big record).
Example: Start with a record of size 1, in your case 3F. Then check whether this is the complete record by checking whether all subsequent bytes are 3F as well. You can stop with the next byte because it differs. Now look for the next 3F. It occurs at index 3 (zero based). Now you know your record is at least 3 bytes long. Assume your record is 3 bytes long. Check if all subsequent three byte blocks match your record. Done!
Upvotes: 0
Reputation: 1
You can look at suffix trees, you can insert all suffixes of your string into a suffix tree and count the number of times a certain substring occurs, then do tree traversal and find your answer.
Upvotes: 0
Reputation: 134045
One possibility is to take the size of the file and factor it. For example, if the file size was 1280, then you know that the record size is one of the following:
1,2,4,5,8,10,16,20,32,40,64,80,128,160,256,320,640,1280
You could then test each of those assumptions until you find a match or exhaust the possibilities.
Of course, this assumes that the file is not truncated or otherwise corrupted.
That's probably not the most efficient way to do it, but it's quick to code and could work quite fast enough for your purposes. It rather depends on how large your files are and how often you'll want to do this. Sometimes the brute force solution is the right solution, even if it's not the "best" solution.
Upvotes: 1