Reputation: 3015
What would be the correct approach when you need to compare 2 very large arraylists with each other?
These arraylist are both 100,000 items in size and will definitely crash when simply comparing item per item.
for (CItem c : cItems) {
for (CItem r : rItems) {
if (c.getID().equals(r.getID())) {
Mismatch m = compareItems(c, r);
if (m != null) {
mismatches.add(m);
}
}
}
}
Now I'm not 100% sure how the garbage collection works in this situation but the errors we get are:
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:3664) ~[na:1.8.0_73]
at java.lang.String.<init>(String.java:207) ~[na:1.8.0_73]
at java.lang.StringBuilder.toString(StringBuilder.java:407) ~[na:1.8.0_73]
and
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.Arrays.copyOf(Arrays.java:3181) ~[na:1.8.0_73]
at java.util.ArrayList.grow(ArrayList.java:261) ~[na:1.8.0_73]
at java.util.ArrayList.ensureExplicitCapacity(ArrayList.java:235) ~[na:1.8.0_73]
at java.util.ArrayList.ensureCapacityInternal(ArrayList.java:227) ~[na:1.8.0_73]
at java.util.ArrayList.add(ArrayList.java:458) ~[na:1.8.0_73]
So far the possible solutions are
Any input on this matter would be appreciated.
Upvotes: 1
Views: 3511
Reputation: 557
I have the same problem. So I tried to use LinkedList. So I have 2 Linkedlist that can contain up to 3.5millions of string records. And then I am running
LinkedList diflist= (LinkedList) ListUtils.subtract(sourceList, targetList);
to get the difference,but my application stacks on this.
So are there any good algorithms to compare lists?
Upvotes: 0
Reputation: 23655
If the IDs in any item-list are unique, you can use a Map
for your rItems
with the ID
as key.
Map<Long, CItem> rItemMap = new HashMap<>(rItems.size());
for (CItem r : rItems) {
rItemMap.put(r.getID(), r);
}
Now you can check directly for rItems with same ID:
for (CItem c : cItems) {
CItem r = rItemMap.get(c.getID());
if (r != null) {
Mismatch m = compareItems(c, r);
if (m != null) {
mismatches.add(m);
}
}
}
Even if the IDs are not unique, you could still work with a Map, you just would have a List of all items with that ID as the value of one Map.Entry and you'd only have to iterate over those few items instead of iterating over the whole list.
Edit regarding OutOfMemory
I just saw from your Exception, that you're using ArrayList
. Using LinkedList
instead might help, because the ArrayList is based on a (fixed size) array and when that array is filled up, a new - larger - array is allocated and the data from the old array is copied to the new array and then freed.
So if you have an array of size 1000 and it is full, a new array of e.g. size 2000 is allocated. At that moment, memory for 3000 items is required (although the 1000 are freed shortly after).
A LinkedList just allocates memory for every item you add to it (plus memory to point to the next and previous element).
Upvotes: 3
Reputation: 5552
Sort the 2 lists and then compare them in order. Sorting costs O(n log n)
and compare costs O(n)
.
Comparator<CItem> idComparator = new Comparator<CItem>() {
@Override
public int compare(CItem i1, CItem i2) {
// Implementation depends on the type of CItem ID:
// if ID is an integer or double, maybe you need
// return i1.getID() - i2.getID();
return i1.getID().compareTo(i2.getID());
}
});
Collections.sort(cItems, idComparator);
Collections.sort(rItems, idComparator);
int minLen = Math.min(cItems.size(), rItems.size());
for (int i = 0, j = 0; i < minLen && j < minLen; ) {
CItem c = cItems.get(i);
CItem r = rItems.get(j);
// c.getID().equals(r.getID())
if (idComparator.compare(c, r) == 0) {
Mismatch m = compareItems(c, r);
if (m != null) {
mismatches.add(m);
}
i++;
j++;
// item c's ID does not exist in list rItems
} else if (idComparator.compare(c, r) < 0) {
i++;
// item r's ID does not exist in list cItems
} else {
j++;
}
}
Upvotes: 1
Reputation: 2189
It looks you want to see if 2 objects with the same ID are the same when compared an other way.
Probable Problem here is that you check 100.000 x 100.000 objects against each other. To make it worse, you just add the ones to a new list...
Option 1) You did not tell how you created the ArrayList()s. If you get the objects from a Database you might just query that. (those are good at that, even if you aren't)
Option 2) Add the 2 ArrayList()s together, they seem to be the same kind of objects. Make the objects sortable (maybe by ID), sort the single list. (creates an other problem) Then use a loop to compare the now sorted Objects to their neighbour.
Upvotes: 3
Reputation: 48258
You can use the method removeAll in the collection interface :)
rItems.removeAll(cItems);
if you look inside of the implementtion, the method compares using equals aswell...
This approach would let you obtain the items from each list which do not match to the other one.
Upvotes: 1