Daud
Daud

Reputation: 7877

Which is more efficient : using removeAll() or using the following HashMap technique to retain only changed records in an ArrayList

I have 2 ArrayLists A and B of the same datastructure C (hashCode() and equals() overridden). C represents a student's record. The two lists are of the same size and represent new student records and old ones respectively (the students are the same in both the lists, ordering might be different). I wish to keep only those records in A that have been changed. As such, I do :

 A.removeAll(B)

As per the javadocs, this would take each record of A and compare with each record of B, and if it finds both equal, it will remove the record from A. If a record of A is not found to be equal to any record in B, and since all students in A are also in B, it means that that record of A has changed. The problem is that its easily of n square complexity.

Another approach can be :

Map<C> map = new HashMap<C>();
for (C record : B){
    map.add(record.getStudentId(),record);
}
List<C> changedRecords = new ArrayList<C>();
for (C record : A){
    if (record.equals(map.get(record.getStudentId())){
        changedRecords.add(record);
    }
}

I think this might be of a lower complexity than the above solution. Is that correct ?

Upvotes: 14

Views: 18605

Answers (4)

Yann TM
Yann TM

Reputation: 2075

I have encountered a performance bottleneck in member removeAll in some instances (EMF model manipulation related). For ArrayList as mentionned above, just use standard removeAll, but if A is for instance an EList, n^2 can be encountered.

Hence,avoid relying on hidden good properties of specific implementations of List< T > ; Set.contains() O(1) is a guarantee (if you use a HashSet and have a decent hashCode, log2(n) for TreeSet with ordering relation), use that to bound algorithmic complexity.

I use the following code that avoids useless copies; intention is that you are scanning a data structure finding irrelevant elements you don't want and adding them to "todel".

For some reason like avoiding concurrent modifications, you are navigating a tree etc..., you cannot remove elements as you are doing this traversal. So, we cumulate them into a HashSet "todel".

In the function, we need to modify "container" in place, since it is typically an attribute of the caller, but using remove(int index) on "container" might induce a copy because of left shift of elements. We use a copy "contents" to achieve this.

Template argument is because during selection process, I often get subtypes of C, but feel free to use < T > everywhere.

/**
 * Efficient O (n) operation to removeAll from an aggregation.
 * @param container a container for a set of elements (no duplicates), some of which we want to get rid of
 * @param todel some elements to remove, typically stored in a HashSet.
 */
public static <T> void removeAll ( List<T> container, Set<? extends T> todel ) {
    if (todel.isEmpty())
        return;
    List<T> contents = new ArrayList<T>(container);
    container.clear();
    // since container contains no duplicates ensure |B| max contains() operations
    int torem = todel.size();
    for (T elt : contents) {
        if ( torem==0 || ! todel.contains(elt) ) {
            container.add(elt);
        } else {
            torem--;
        }
    }
}

So in your case you would invoke with : removeAll(A, new HashSet < C >(B)); paying one copy of B if you really cannot accumulate into a Set< C > during the selection phase.

Place it in a utility class and static import for ease of use.

Upvotes: 1

aioobe
aioobe

Reputation: 420951

Yes the latter algorithm is better than O(n^2), since you have two loops, one ranging over B and another over A and you do (amortized) constant work in each loop, your new solution runs in O(|A| + |B|).

I suspect that you don't have any duplicate entries though. If this is the case, you could also go via a HashSet (change to LinkedHashSet if you want to preserve the order in A):

HashSet<C> tmp = new HashSet<C>(A);
tmp.removeAll(B);                     // Linear operation
A = new ArrayList<C>(tmp);

(Or if order doesn't matter to you, you could use HashSets all the way through.)


As pointed out by @Daud in the comments below, HashSet.removeAll(Collection c) actually calls c.contains repeatedly if the size of the hash set is smaller than the collection which affects the complexity (at least in OpenJDK). This is because the implementation always chooses to iterate over the smaller collection.

Upvotes: 11

Anish Dasappan
Anish Dasappan

Reputation: 415

Definitely second 'algorithm' is better than first considering amortized analysis. is it the best way? do you need that? will it cause any visible impact to user in terms of performance does the number of items in list grow so huge, that this becomes a bottleneck in the system?

First approach is more readable, conveys your intention to people who maintain the code. Also it is preferable to use 'tested' API instead of re-inventing the wheel (unless absolutely necessary) Computers have become so fast that we shouldn't do any premature optimizations.

if seen essential I might go with a solution using Set, similar to @aioob's

Upvotes: 1

TechTrip
TechTrip

Reputation: 4537

What you may save on complexity you could be losing in memory allocation so isn't necessarily more efficient. Arrraylist uses something similar to an in place partitioning algorithm to run down the backing array and test against the compare.

When comparing it simply looks to find the index of the first occurrence of a match against the backing array Object[]. The algorithm maintains two indexes, one to iterate through the backing array and one as a placeholder for matches. In the case of a match it simply moves the index on the backing array and carries on to the next incoming element; this is relatively cheap.

If it comes to a point where it finds that the incoming collection doesn't contain the value at the current index in the backing array it simply overwrites the element where the last match occurred with the element at the current index without incurring a new memory allocation. This pattern repeats until all elements in the ArrayList have been compared against the incoming collection, hence the complexity you are concerned about.

For example: Consider an arraylist A with 1,2,4,5 and a collection 'C' with 4,1 that we match against; wanting to remove 4 and 1. here is each iteration on the for loop that would go 0 -> 4

Iteration: r is the for loop index on arraylist a for (; r < size; r++)

r = 0 (does C contain 1? Yes, skip to the next one) A: 1,2,4,5 w = 0

r = 1 (Does C contain 2? No, copy the value at r into the spot pointed to by w++) A: 2,2,4,5 w=1

r = 2 (Does C contain 4?, Yes skip) A: 2,2,4,5 w=1

r = 3 (Does C contain 5? No, copy the value at r into the spot pointed to by w++)

A: 2,5,4,5 w=2

r=4, stop

Compare w to the size of the backing array which is 4. Since they are not equal Null out the values from w on to the end of the array and reset the size.

A: 2,5 size of 2

The built in removeAll also considers that ArrayLists can contain null. You could throw an NPE on record.getStudentId() in your solution above. Finally, removeAll protects against exceptions in the compare on Collection.contains. if that happens it uses finally to do a native memcopy which protects the backing array from corruption in a highly efficient manner.

Upvotes: 1

Related Questions