Abstract Algorithm: String / Byte Comparison / Diff

Question

This is a rather abstract question as I yet have no idea how to solve it and haven't found any suitable solutions.

Let's start with the current situation. You'll have an array of byte[] (e.g. ArrayList) which behind the scene are actually Strings, but at the current state the byte[] is prefered. They can be very long (1024+ bytes for each byte[] array whereas the ArrayList may contain up to 1024 byte[] arrays) and might have a different length. Furthermore, they share a lot of the same bytes at the "same" locations (this is relativ, a = {0x41, 0x41, 0x61}, b = {0x41, 0x41, 0x42, 0x61 } => where the first 0x41 and the last 0x61 are the same).

I'm looking now for an algorithm that compares all those arrays with each other. The result should be the array that differs the most and how much they differ from each other (some kind of metric). Furthermore, the task should complete within a short time.

If possible without using any third party libraries (but i doubt it is feasible in a reasonable time without one).

Any suggestions are very welcome.

Edit:

Made some adjustments.

EDIT / SOLUTION:

I'm using the Levenshtein distance now. Furthermore, I've made some slight adjustments to improve the runtime / speed. This is very specific to the data I'm handling as I know that all Strings have a lot in common (and I know approximatly where). So filtering that content improves the speed by a factor of 400 in comparison to two unfiltered Strings (test data) used directly by the Levenshtein distance algorithm.

Thanks for your input / answers, they were a great assistance.

Code Monkey · Accepted Answer

I'm using the Levenshtein distance now. Furthermore, I've made some slight adjustments to improve the runtime / speed. This is very specific to the data I'm handling as I know that all Strings have a lot in common (and I know approximatly where). So filtering that content improves the speed by a factor of 400 in comparison to two unfiltered Strings (test data) used directly by the Levenshtein distance algorithm.

Thanks for your input / answers, they were a great assistance.

Abstract Algorithm: String / Byte Comparison / Diff

Answers (2)

Related Questions