DGK
DGK

Reputation: 3015

Comparing 2 very large arraylists in java

What would be the correct approach when you need to compare 2 very large arraylists with each other?

These arraylist are both 100,000 items in size and will definitely crash when simply comparing item per item.

for (CItem c : cItems) {
        for (CItem r : rItems) {
            if (c.getID().equals(r.getID())) {
                Mismatch m = compareItems(c, r);
                if (m != null) {
                    mismatches.add(m);
                }
            }
        }
    }

Now I'm not 100% sure how the garbage collection works in this situation but the errors we get are:

java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:3664) ~[na:1.8.0_73]
at java.lang.String.<init>(String.java:207) ~[na:1.8.0_73]
at java.lang.StringBuilder.toString(StringBuilder.java:407) ~[na:1.8.0_73]

and

java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.Arrays.copyOf(Arrays.java:3181) ~[na:1.8.0_73]
at java.util.ArrayList.grow(ArrayList.java:261) ~[na:1.8.0_73]
at java.util.ArrayList.ensureExplicitCapacity(ArrayList.java:235) ~[na:1.8.0_73]
at java.util.ArrayList.ensureCapacityInternal(ArrayList.java:227) ~[na:1.8.0_73]
at java.util.ArrayList.add(ArrayList.java:458) ~[na:1.8.0_73]

So far the possible solutions are

Any input on this matter would be appreciated.

Upvotes: 1

Views: 3511

Answers (5)

rholdberh
rholdberh

Reputation: 557

I have the same problem. So I tried to use LinkedList. So I have 2 Linkedlist that can contain up to 3.5millions of string records. And then I am running

LinkedList diflist= (LinkedList) ListUtils.subtract(sourceList, targetList);

to get the difference,but my application stacks on this.

So are there any good algorithms to compare lists?

Upvotes: 0

Ridcully
Ridcully

Reputation: 23655

If the IDs in any item-list are unique, you can use a Map for your rItems with the ID as key.

Map<Long, CItem> rItemMap = new HashMap<>(rItems.size());
for (CItem r : rItems) {
    rItemMap.put(r.getID(), r);
}

Now you can check directly for rItems with same ID:

for (CItem c : cItems) {
    CItem r = rItemMap.get(c.getID());
    if (r != null) {
        Mismatch m = compareItems(c, r);
        if (m != null) {
            mismatches.add(m);
        }
    }
}

Even if the IDs are not unique, you could still work with a Map, you just would have a List of all items with that ID as the value of one Map.Entry and you'd only have to iterate over those few items instead of iterating over the whole list.

Edit regarding OutOfMemory

I just saw from your Exception, that you're using ArrayList. Using LinkedList instead might help, because the ArrayList is based on a (fixed size) array and when that array is filled up, a new - larger - array is allocated and the data from the old array is copied to the new array and then freed.

So if you have an array of size 1000 and it is full, a new array of e.g. size 2000 is allocated. At that moment, memory for 3000 items is required (although the 1000 are freed shortly after).

A LinkedList just allocates memory for every item you add to it (plus memory to point to the next and previous element).

Upvotes: 3

Mincong Huang
Mincong Huang

Reputation: 5552

Sort the 2 lists and then compare them in order. Sorting costs O(n log n) and compare costs O(n).

Comparator<CItem> idComparator = new Comparator<CItem>() {
    @Override
    public int compare(CItem i1, CItem i2) {
        // Implementation depends on the type of CItem ID:
        // if ID is an integer or double, maybe you need
        // return i1.getID() - i2.getID();
        return i1.getID().compareTo(i2.getID());
    }
});

Collections.sort(cItems, idComparator);
Collections.sort(rItems, idComparator);

int minLen = Math.min(cItems.size(), rItems.size());
for (int i = 0, j = 0; i < minLen && j < minLen; ) {

    CItem c = cItems.get(i);
    CItem r = rItems.get(j);

    // c.getID().equals(r.getID())
    if (idComparator.compare(c, r) == 0) {
        Mismatch m = compareItems(c, r);
        if (m != null) {
            mismatches.add(m);
        }
        i++;
        j++;

    // item c's ID does not exist in list rItems
    } else if (idComparator.compare(c, r) < 0) {
        i++;

    // item r's ID does not exist in list cItems
    } else {
        j++;
    }
}

Upvotes: 1

It looks you want to see if 2 objects with the same ID are the same when compared an other way.

Probable Problem here is that you check 100.000 x 100.000 objects against each other. To make it worse, you just add the ones to a new list...

Option 1) You did not tell how you created the ArrayList()s. If you get the objects from a Database you might just query that. (those are good at that, even if you aren't)

Option 2) Add the 2 ArrayList()s together, they seem to be the same kind of objects. Make the objects sortable (maybe by ID), sort the single list. (creates an other problem) Then use a loop to compare the now sorted Objects to their neighbour.

Upvotes: 3

You can use the method removeAll in the collection interface :)

rItems.removeAll(cItems);

if you look inside of the implementtion, the method compares using equals aswell...

This approach would let you obtain the items from each list which do not match to the other one.

Upvotes: 1

Related Questions