Reputation: 78
This is a rather abstract question as I yet have no idea how to solve it and haven't found any suitable solutions.
Let's start with the current situation. You'll have an array of byte[]
(e.g. ArrayList<byte[]>
) which behind the scene are actually Strings, but at the current state the byte[]
is prefered. They can be very long (1024+ bytes for each byte[]
array whereas the ArrayList
may contain up to 1024 byte[]
arrays) and might have a different length. Furthermore, they share a lot of the same bytes at the "same" locations (this is relativ, a = {0x41, 0x41, 0x61}, b = {0x41, 0x41, 0x42, 0x61 } => where the first 0x41 and the last 0x61 are the same).
I'm looking now for an algorithm that compares all those arrays with each other. The result should be the array that differs the most and how much they differ from each other (some kind of metric). Furthermore, the task should complete within a short time.
If possible without using any third party libraries (but i doubt it is feasible in a reasonable time without one).
Any suggestions are very welcome.
Edit:
Made some adjustments.
EDIT / SOLUTION:
I'm using the Levenshtein distance now. Furthermore, I've made some slight adjustments to improve the runtime / speed. This is very specific to the data I'm handling as I know that all Strings have a lot in common (and I know approximatly where). So filtering that content improves the speed by a factor of 400 in comparison to two unfiltered Strings (test data) used directly by the Levenshtein distance algorithm.
Thanks for your input / answers, they were a great assistance.
Upvotes: 0
Views: 184
Reputation: 78
I'm using the Levenshtein distance now. Furthermore, I've made some slight adjustments to improve the runtime / speed. This is very specific to the data I'm handling as I know that all Strings have a lot in common (and I know approximatly where). So filtering that content improves the speed by a factor of 400 in comparison to two unfiltered Strings (test data) used directly by the Levenshtein distance algorithm.
Thanks for your input / answers, they were a great assistance.
Upvotes: 0
Reputation: 8345
The result should be the array that differs the most and how much they differ from each other (some kind of metric). Furthermore, the task should complete within a short time.
You will not be able to find a solution where your metric and the time is independent, they go hand in hand.
For example: if your metric is like the example from your post, that is d(str1,str2) = d(str1.first,str2.first) + d(str1.last,str2.last)
, then the solution is very easy: sort your array by first and last character (maybe separately), and then take the first and last element of the sorted array. This will give you O(n logn)
for the sort.
But if your metric is something like "two sentences are close if they contain many equal words", then this does not work at all, and you end up with O(n²)
. Or you may be able to come up with a nifty way to re-order your words within the sentences before sorting the sentences etc. etc.
So unless you have a known metric, it's O(n²)
with the trivial (naive) implementation of comparing everything while keeping track of the maximum delta.
Upvotes: 1