Reputation: 183
I have millions of fixed-size (100) int arrays. Each array is sorted and has unique elements. For each array, I want to find all arrays which have 70% common elements. Right now I am getting around 1 million comparisons (using Arrays.binarySearch()) per second, which is too slow for us.
Can anyone recommend a better searching algorithm?
Upvotes: 12
Views: 3531
Reputation: 533492
You could try a merge sort ignoring duplicates. This is an O(n) operation for sorted arrays. If the two arrays have 70% elements in common the resulting collection will have 130 or less unique ints. In your case you don't need the result so you can just count the number of unique entries and stop as soon as you reach 131 or the end of both arrays.
EDIT (2) The following code can do ~176 billion comparisions in about 47 seconds using 4 cores. Making the code multi-threaded with 4 cours was only 70% faster.
Using BitSet only works if the range of int values is fairly small. Otherwise you have to compare the int[] (I have left the code in should you need it)
Peformed 176,467,034,428 comparisons in 47.712 seconds and found 444,888 matches
public static void main(String... args) throws InterruptedException {
int length = 100;
int[][] ints = generateArrays(50000, length);
final BitSet[] bitSets = new BitSet[ints.length];
for(int i=0;i<ints.length;i++) {
int[] ia = ints[i];
BitSet bs = new BitSet(ia[ia.length-1]);
for (int i1 : ia)
bs.set(i1);
bitSets[i] = bs;
}
final AtomicInteger matches = new AtomicInteger();
final AtomicLong comparisons = new AtomicLong();
int nThreads = Runtime.getRuntime().availableProcessors();
ExecutorService executorService = Executors.newFixedThreadPool(nThreads);
long start = System.nanoTime();
for (int i = 0; i < bitSets.length - 1; i++) {
final int finalI = i;
executorService.submit(new Runnable() {
public void run() {
for (int j = finalI + 1; j < bitSets.length; j++) {
int compare = compare(bitSets[finalI], bitSets[j]);
if (compare <= 130)
matches.incrementAndGet();
comparisons.addAndGet(compare);
}
}
});
}
executorService.shutdown();
executorService.awaitTermination(1, TimeUnit.HOURS);
long time = System.nanoTime() - start;
System.out.printf("Peformed %,d comparisons in %.3f seconds and found %,d matches %n",comparisons.longValue(),time/1e9, matches.intValue());
}
private static int[][] generateArrays(int count, int length) {
List<Integer> rawValues = new ArrayList<Integer>(170);
for (int i = 0; i < 170; i++)
rawValues.add(i);
int[][] ints = new int[count][length];
Random rand = new Random(1);
for (int[] ia : ints) {
Collections.shuffle(rawValues, rand);
for (int i = 0; i < ia.length; i++)
ia[i] = (int) (int) rawValues.get(i);
Arrays.sort(ia);
}
return ints;
}
private static int compare(int[] ia, int[] ja) {
int count = 0;
int i=0,j=0;
while(i<ia.length && j<ja.length) {
int iv = ia[i];
int jv = ja[j];
if (iv < jv) {
i++;
} else if (iv > jv) {
j++;
} else {
count++; // duplicate
i++;
j++;
}
}
return ia.length + ja.length - count;
}
private static int compare(BitSet ia, BitSet ja) {
BitSet both = new BitSet(Math.max(ia.length(), ja.length()));
both.or(ia);
both.or(ja);
return both.cardinality();
}
Upvotes: 3
Reputation: 298838
Something like this should do the job (provided that the arrays are sorted and contain unique elements):
public static boolean atLeastNMatchingElements(final int n,
final int[] arr1,
final int[] arr2){
/* check assumptions */
assert (arr1.length == arr2.length);
final int arrLength = arr2.length;
{ /* optimization */
final int maxOffset = Math.max(arrLength - n, 0);
if(arr1[maxOffset] < arr2[0] || arr2[maxOffset] < arr1[0]){
return false;
}
}
int arr2Offset = 0;
int matches = 0;
/* declare variables only once, outside loop */
int compResult; int candidate;
for(int i = 0; i < arrLength; i++){
candidate = arr1[i];
while(arr2Offset < arrLength - 1){
compResult = arr2[arr2Offset] - candidate;
if(compResult > 0){
break;
} else{
arr2Offset++;
if(compResult == 0){
matches++;
break;
}
}
}
if(matches == n){
return true;
}
/* optimization */
else if(matches < n - (arrLength - arr2Offset)){
return false;
}
}
return false;
}
Sample usage:
public static void main(final String[] args){
final int[] arr1 = new int[100];
final int[] arr2 = new int[100];
int x = 0, y = 0;
for(int i = 0; i < 100; i++){
if(i % 10 == 0){
x++;
}
if(i % 12 == 0){
y++;
}
arr1[i] = x;
arr2[i] = y;
x++;
y++;
}
System.out.println(atLeastNMatchingElements(70, arr1, arr2));
System.out.println(atLeastNMatchingElements(95, arr1, arr2));
}
Output:
true
false
I have now tried to optimize the above code. Please check whether the code blocks marked as
/* optimization */
make a noticeable difference. After optimization, I would refactor the code to get it down to one or two return statements.
Upvotes: 3
Reputation: 49724
There are two quick optimisations you can make.
If array A's start element is greater than B's end element, they trivially can't have common elements.
They other one is a triangle inequality-like thing:
f(B,C) <= 100 - |f(A,B)-f(A,C)|
The reason for this is that (assuming f(A,B) > f(A,C)
) there are at least f(A,B) - f(A,C)
elements that are in both A and B but not in C. Which means that they can't be common elements of B and C.
Upvotes: 3