Java Searching through two Arrays

Question

I have 2 ArrayList's. ArrayList A has 8.1k elements and ArrayList B has 81k elements.

I need to iterate through B, search for that particular item in A then change a field in the matched element in list B.

Here's my code:

private void mapAtoB(List aList, ListIterator it) {
    AtomicInteger i = new AtomicInteger(-1);
    while(it.hasNext()) {
        System.out.print(i.incrementAndGet() + ", ");
        B b = it.next();
        aList.stream().filter(a -> b.equalsB(a)).forEach(a -> {
            b.setId(String.valueOf(a.getRedirectId()));
            it.set(b);
        });
    }
    System.out.println();
}

public class B {
    public boolean equalsB(A a) {
        if (a == null) return false;

        if (this.getFullURL().contains(a.getFirstName())) return true;

        return false;
    }
}

But this is taking forever. To finish this method it takes close to 15 minutes. Is there any way to optimize any of this? 15 min run time is way too much.

starikoff · Accepted Answer

I'll be happy to see a good and thorough solution, meanwhile I can propose two ideas (or maybe two reincarnations of one).

The first one is to speed up searching of all objects of type A in one object of type B. For that, Rabin-Karp algorithm seems applicable and simple enough to quickly implement, and Aho-Corasick harder but will probably give better results, not sure how much better.

The other option is to limit the number of objects of type B which should be fully processed for each object of A, for that you could e.g. build an inverse N-gram index: for each fullUrl you take all its substrings of length N ("N-grams"), and you build a map from each such N-gram to a set of B's that have such N-gram in their fullUrl. When searching for an object A, you take all of its N-grams, find a set of B's for each such N-gram and intersect all these sets, the intersection will contain all B's that you should fully process. I implemented this approach quickly, for the sizes you specified it gives a 6-7 time speedup for N=4; as N grows, search becomes faster, but building the index slows down (so if you can reuse it you are probably better off choosing a bigger N). This index takes about 200 Mb for the sizes you specified, so this approach will only scale this far with the growth of the collection of B's. Assuming that all strings are longer than NGRAM_LENGTH, here's the quick and dirty code for building the index using Guava's SetMultimap, HashMultimap:

    SetMultimap idx = HashMultimap.create();
    for (B b : bList) {
        for (int i = 0; i < b.getFullURL().length() - NGRAM_LENGTH + 1; i++) {
            idx.put(b.getFullURL().substring(i, i + NGRAM_LENGTH), b);
        }
    }

And for the search:

private void mapAtoB(List aList, SetMultimap mmap) {
    for (A a : aList) {
        Collection possible = null;
        for (int i = 0; i < a.getFirstName().length() - NGRAM_LENGTH + 1; i++) {
            String ngram = a.getFirstName().substring(i, i + NGRAM_LENGTH);
            Set forNgram = mmap.get(ngram);
            if (possible == null) {
                possible = new ArrayList<>(forNgram);
            } else {
                possible.retainAll(forNgram);
            }
            if (possible.size() < 20) { // it's ok to scan through 20
                break;
            }
        }
        for (B b : possible) {
            if (b.equalsB(a)) {
                b.setId(a.getRedirectId());
            }
        }
    }
}

A possible direction for optimization would be to use hashes instead of full N-grams thus reducing the memory footprint and necessity for N-gram key comparisons.

Java Searching through two Arrays

Answers (1)

Related Questions