Reputation: 745

efficient large string (in) equality function

I have the need to compare very large file-based strings of equal length for simple equality, without first calculating a hash.

I want to use the data in the string to make large, seemingly random jumps, so that I can quickly determine a test for inequality even in strings that start and end the same way. That is, I want to jump throughout the range, in some way that mostly or completely avoids hitting the same character too many times.

Since the strings are file-based and very large, I don't want my jumps to be too large because that will thrash the disk.

In my program, a string is simple a sequence of chars backed by a file and less than 2gig in size, but rarely completely in memory at once.

Then after trying for awhile I assume they are equal and I just iterate in order.

My string class variations all have a base interface of int length() and char charAt() functions, assuming java chars, which are usually but not always ascii.

Any ideas, Andy

Upvotes: 2

Answers (5)

diginoise

Reputation: 7630

Use your OS

Did you try comparing checksums like md5sum as calculated by your operating system?

Most modern OSs will have utilities for calculating checksums of files, and done by the kernel are usually very fast.

File systems

Some file systems (brtfs, ZFS, ...) have checksums of the data stored within each block. Having such file system, calculating the checksum of the whole very large file should be not hard.

I would like to know of such tools...

Programatically

Use as many threads as CPUs available on the platform
ExecutorService e = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors());
Within each thread open both files as READ ONLY and map a non-overlapping segments of the files to MappedByteBuffers:

FileChannel fc1 = new RandomAccessFile(new File("/path/to/file1"), "ro").getChannel(); MappedByteBuffer mem1 = fc1.map(FileChannel.MapMode.READ_ONLY, offset, BUFFER_SIZE); FileChannel fc2 = new RandomAccessFile(new File("/path/to/file2"), "ro").getChannel(); MappedByteBuffer mem2 = fc2.map(FileChannel.MapMode.READ_ONLY, offset, BUFFER_SIZE);
Call Arrays.equals(mem1.array(), mem2.array())

Now instead of jumping to random byte within the files, make the jumps to sequential offsets of the files, comparing BUFFER_SIZE bytes chunks at the time per each thread in number_of_available_cores threads simultaneously.

Adjusting the BUFFER_SIZE to the block size on the disk, and the page size in Virtual Memory should yield much desired speedup. The biggest slowdown of the whole comparison will come from Virtual Memory's PAGE FAULTS, SWAPPING, and worst of all THRASHING.

See here for more information about monitoring VirtMem performance of your code on Linux. On Windows VMMap could be of help. See also this TechNet article on the various counters available in Windows and This article explaining VirtMem workings on Windows

Above also means that sequential processing instead of random jumps produces better results, as it leads to less PAGE_FAULTS and minimizes VirtMem page THRASHING

Holding the bit vector in memory of the chunks already verified, you can calculate precise certainty of the equality. Then when the decision to compare the whole file is made, all you have to do is visit the not-yet-visited chunks of the files.

Upvotes: 0

helios

Reputation: 13841

Compare entire blocks. The cost of comparing an entire block in-memory is lower than the cost of reading the blocks. So I should recommend that if you read a block, compare its content entirely.
You should read blocks necesarily. Reading from a file always means reading chunks of disk. So if you read from a file try to read a complete block. If you know (or can infer) how big is the block readed, much better. Make your chunk that size.
Choose your blocks. As you are comparing all block once in memory, it has no sense to read each block from the beginning. So you can try an "expanding strategy". Start at block 0, then try with 1, if they remain being equal try with 3, with 7, and so on. It is, make the "block offset" bigger with each block compared. It can be exponencial (multiplying block_offset by 2 each time) but take into account that this approach privileges the start of the file (maybe you can reduce the offset once passed the middle of the file).

Metadata

Said that: if you has any control over the files (it is, you are generating them) you should extract some metadata and make it available. Like a hash or something.

Of course if you process a file (or a block of a file) more than once you should try to generate that metadata.

Hope it helps!

Upvotes: 0

Louis Ricci

Reputation: 21106

CPUs and HDDs like reading data sequentially; it's easier to cache and process.

So your basic algorithm will be

Choose a CHUNK size ?16KB? Choose how many COMPARES, characters/bytes you want to compare per CHUNK ?128?, make sure CHUNK is a multiple of COMPARES Sequentially read a CHUNK from file 1 Sequentially read a CHUNK from file 2 Randomly (but sequentially) compare those two chunks Repeat until EOF or comparisons are not equal or some other metric of satisfaction

static int CHUNK = 4096 * 16;
static int COMPARES = 128;
static int CMP_STEP = CHUNK / COMPARES
static Random RND = new Random();
static boolean AreFilesProbablyEqual(FileReader readerA, FileReader readerB) { 
 char[] buffA = new char[CHUNK];
 char[] buffB = new char[CHUNK];
 int readA = 0;
 int readB = 0;
 while(readA != -1) { // read a CHUNK at a time
  readA = readA.read(buffA);
  readB = readB.read(buffB);
  if(readA != readB) return false; // size mismatch files are not equal
  if(readA > 0) { // work through the chunk and randomly but sequentially compare
   for(int i = 0; i < readA; i = i + CMP_STEP) {
    int range = Math.min(readA - i, CMP_STEP);
    int idx = RND.next(range) + i;
    if(buffA[idx] != buffB[idx]) return false;
   }
  }
 }
 return true; // they are PROBABLY be equal
}

note This code was written in the browser and was not tested, as a result, syntax errors may be present.

Upvotes: 0

Marcus

Reputation: 12596

There is probably not a simple and best single solution for this. Here's my two cents:

If you're able to do some precalculations and store data use a space-time tradeoff as glowcoder suggested.

The standard O(n) solution would be to do a regular character by character comparison for each character but in this case you need something more efficient. One possible solution would be define a step length, e.g. 10, and then compare every 10 th character instead. The advantage with this over using random is that you'll save a couple of cycles calculating the randomness and you would also not compare a character twice, as it would never collide. The problem with such a solution is if there's a long prefix to the string that's often equal.

If there's large prefixes and suffixes in the strings comparisons of random character, as you mentioned, might speed things up. But there's the problem with reading from disk if you cannot hold all information in memory you can end up doing a lot of slow reading from disk, and if you're unlucky doing a lot of switching between different platters.

Upvotes: 0

corsiKa

Reputation: 82579

Build some meta data about your giant strings.

Let's say you have them split into logical pages or blocks. You pick a block size and when you load a block into memory you hash it, storing this hash in a lookup table.

When you go to compare two files, you can first compare known hashes of subsections before going to disk to get more.

This should give you a good balance of caching and removing the need for disk access, without giving you too much overhead.

Upvotes: 3

efficient large string (in) equality function

Answers (5)

Use your OS

File systems

Programatically

Related Questions