Reputation: 745
I have the need to compare very large file-based strings of equal length for simple equality, without first calculating a hash.
I want to use the data in the string to make large, seemingly random jumps, so that I can quickly determine a test for inequality even in strings that start and end the same way. That is, I want to jump throughout the range, in some way that mostly or completely avoids hitting the same character too many times.
Since the strings are file-based and very large, I don't want my jumps to be too large because that will thrash the disk.
In my program, a string is simple a sequence of chars backed by a file and less than 2gig in size, but rarely completely in memory at once.
Then after trying for awhile I assume they are equal and I just iterate in order.
My string class variations all have a base interface of int length() and char charAt() functions, assuming java chars, which are usually but not always ascii.
Any ideas, Andy
Upvotes: 2
Views: 2210
Reputation: 7630
Did you try comparing checksums like md5sum
as calculated by your operating system?
Most modern OSs will have utilities for calculating checksums of files, and done by the kernel are usually very fast.
Some file systems (brtfs, ZFS, ...) have checksums of the data stored within each block. Having such file system, calculating the checksum of the whole very large file should be not hard.
I would like to know of such tools...
ExecutorService e = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors());
Within each thread open both files as READ ONLY and map a non-overlapping segments of the files to MappedByteBuffer
s:
FileChannel fc1 = new RandomAccessFile(new File("/path/to/file1"), "ro").getChannel();
MappedByteBuffer mem1 = fc1.map(FileChannel.MapMode.READ_ONLY, offset, BUFFER_SIZE);
FileChannel fc2 = new RandomAccessFile(new File("/path/to/file2"), "ro").getChannel();
MappedByteBuffer mem2 = fc2.map(FileChannel.MapMode.READ_ONLY, offset, BUFFER_SIZE);
Call Arrays.equals(mem1.array(), mem2.array())
Now instead of jumping to random byte within the files, make the jumps to sequential offsets of the files, comparing BUFFER_SIZE bytes chunks at the time per each thread in number_of_available_cores threads simultaneously.
Adjusting the BUFFER_SIZE to the block size on the disk, and the page size in Virtual Memory should yield much desired speedup. The biggest slowdown of the whole comparison will come from Virtual Memory's PAGE FAULTS, SWAPPING, and worst of all THRASHING.
See here for more information about monitoring VirtMem performance of your code on Linux. On Windows VMMap could be of help. See also this TechNet article on the various counters available in Windows and This article explaining VirtMem workings on Windows
Above also means that sequential processing instead of random jumps produces better results, as it leads to less PAGE_FAULTS and minimizes VirtMem page THRASHING
Holding the bit vector in memory of the chunks already verified, you can calculate precise certainty of the equality. Then when the decision to compare the whole file is made, all you have to do is visit the not-yet-visited chunks of the files.
Upvotes: 0
Reputation: 13841
Metadata
Said that: if you has any control over the files (it is, you are generating them) you should extract some metadata and make it available. Like a hash or something.
Of course if you process a file (or a block of a file) more than once you should try to generate that metadata.
Hope it helps!
Upvotes: 0
Reputation: 21106
CPUs and HDDs like reading data sequentially; it's easier to cache and process.
So your basic algorithm will be
Choose a CHUNK size ?16KB? Choose how many COMPARES, characters/bytes you want to compare per CHUNK ?128?, make sure CHUNK is a multiple of COMPARES Sequentially read a CHUNK from file 1 Sequentially read a CHUNK from file 2 Randomly (but sequentially) compare those two chunks Repeat until EOF or comparisons are not equal or some other metric of satisfaction
static int CHUNK = 4096 * 16;
static int COMPARES = 128;
static int CMP_STEP = CHUNK / COMPARES
static Random RND = new Random();
static boolean AreFilesProbablyEqual(FileReader readerA, FileReader readerB) {
char[] buffA = new char[CHUNK];
char[] buffB = new char[CHUNK];
int readA = 0;
int readB = 0;
while(readA != -1) { // read a CHUNK at a time
readA = readA.read(buffA);
readB = readB.read(buffB);
if(readA != readB) return false; // size mismatch files are not equal
if(readA > 0) { // work through the chunk and randomly but sequentially compare
for(int i = 0; i < readA; i = i + CMP_STEP) {
int range = Math.min(readA - i, CMP_STEP);
int idx = RND.next(range) + i;
if(buffA[idx] != buffB[idx]) return false;
}
}
}
return true; // they are PROBABLY be equal
}
note This code was written in the browser and was not tested, as a result, syntax errors may be present.
Upvotes: 0
Reputation: 12596
There is probably not a simple and best single solution for this. Here's my two cents:
If you're able to do some precalculations and store data use a space-time tradeoff
as glowcoder suggested.
The standard O(n) solution would be to do a regular character by character comparison for each character but in this case you need something more efficient. One possible solution would be define a step length, e.g. 10, and then compare every 10 th character instead. The advantage with this over using random is that you'll save a couple of cycles calculating the randomness and you would also not compare a character twice, as it would never collide. The problem with such a solution is if there's a long prefix to the string that's often equal.
If there's large prefixes and suffixes in the strings comparisons of random character, as you mentioned, might speed things up. But there's the problem with reading from disk if you cannot hold all information in memory you can end up doing a lot of slow reading from disk, and if you're unlucky doing a lot of switching between different platters.
Upvotes: 0
Reputation: 82579
Build some meta data about your giant strings.
Let's say you have them split into logical pages or blocks. You pick a block size and when you load a block into memory you hash it, storing this hash in a lookup table.
When you go to compare two files, you can first compare known hashes of subsections before going to disk to get more.
This should give you a good balance of caching and removing the need for disk access, without giving you too much overhead.
Upvotes: 3