Reputation: 24499

Java Collection performance question

I have created a method that takes two Collection<String> as input and copies one to the other.

However, I am not sure if I should check if the collections contain the same elements before I start copying, or if I should just copy regardless. This is the method:

 /**
  * Copies from one collection to the other. Does not allow empty string. 
  * Removes duplicates.
  * Clears the too Collection first
  * @param src
  * @param dest
  */
 public static void copyStringCollectionAndRemoveDuplicates(Collection<String> src, Collection<String> dest) {
  if(src == null || dest == null)
   return;

  //Is this faster to do? Or should I just comment this block out
  if(src.containsAll(dest))
   return;

  dest.clear();
  Set<String> uniqueSet = new LinkedHashSet<String>(src.size());
  for(String f : src) 
   if(!"".equals(f)) 
    uniqueSet.add(f);

  dest.addAll(uniqueSet);
 }

Maybe it is faster to just remove the

if(src.containsAll(dest))
    return;

Because this method will iterate over the entire collection anyways.

Upvotes: 5

Answers (6)

Julien Rentrop

Reputation: 651

The code is hard to read and is not very efficient. The "dest" parameter is confusing: It's passed as a parameter, then it's cleared and the results are added to it. What's the point of it being a parameter? Why not simply return a new collection? The only benefit I can see is that the caller can determine the collection type. Is that necessary?

I think this code can be more clearly and probably more efficiently written as follows:

public static Set<String> createSet(Collection<String> source) {
    Set<String> destination = new HashSet<String>(source) {
        private static final long serialVersionUID = 1L;

        public boolean add(String o) {
            if ("".equals(o)) {
                return false;
            }
            return super.add(o);
        }
    }; 
    return destination;
}

Another way is to create your own set type:

public class NonEmptyStringSet extends HashSet<String> {
    private static final long serialVersionUID = 1L;

    public NonEmptyStringSet() {
        super();
    }

    public NonEmptyStringSet(Collection<String> source) {
        super(source);
    }

    public boolean add(String o) {
        if ("".equals(o)) {
            return false;
        }
        return super.add(o);
    }
}

Usage:

createSet(source);
new NonEmptyStringSet(source);

Returning the set is more performant because you don't first have to create a temporary set and then add all to the dest collection.

The benefit of the NonEmptyStringSet type is that you can keep adding strings and still have the empty string check.

EDIT1:

Removing the "if(src.containsAll(dest)) return;" code introduces a "bug" when calling the method with source == dest; The result is that source will be empty. Example:

Collection<String> source = new ArrayList<String>();
source.add("abc");
copyStringCollectionAndRemoveDuplicates(source, source);
System.out.println(source);

EDIT2:

I did a small benchmark which shows that my implementation is about 30% faster then a simplified version of your initial implementation. This benchmark is an optimal case for your initial implementation because the dest colletion is empty, so it doesn't have to clear it. Also take not that my implementation uses HashSet instead of LinkedHashSet which makes my implementation a bit faster.

Benchmark code:

public class SimpleBenchmark {
public static void main(String[] args) {
    Collection<String> source = Arrays.asList("abc", "def", "", "def", "", 
            "jsfldsjdlf", "jlkdsf", "dsfjljka", "sdfa", "abc", "dsljkf", "dsjfl", 
            "js52fldsjdlf", "jladsf", "dsfjdfgljka", "sdf123a", "adfgbc", "dslj452kf", "dsjfafl", 
            "js21ldsjdlf", "jlkdsvbxf", "dsfjljk342a", "sdfdsa", "abxc", "dsljkfsf", "dsjflasd4" );

    int runCount = 1000000;
    long start1 = System.currentTimeMillis();
    for (int i = 0; i < runCount; i++) {
        copyStringCollectionAndRemoveDuplicates(source, new ArrayList<String>());
    }
    long time1 = (System.currentTimeMillis() - start1);
    System.out.println("Time 1: " + time1);


    long start2 = System.currentTimeMillis();
    for (int i = 0; i < runCount; i++) {
        new NonEmptyStringSet(source);
    }
    long time2 = (System.currentTimeMillis() - start2);
    System.out.println("Time 2: " + time2);

    long difference = time1 - time2;
    double percentage = (double)time2 / (double) time1;

    System.out.println("Difference: " + difference + " percentage: " + percentage);
}

public static class NonEmptyStringSet extends HashSet<String> {
    private static final long serialVersionUID = 1L;

    public NonEmptyStringSet() {
    }

    public NonEmptyStringSet(Collection<String> source) {
        super(source);
    }

    @Override
    public boolean add(String o) {
        if ("".equals(o)) {
            return false;
        }
        return super.add(o);
    }
}

public static void copyStringCollectionAndRemoveDuplicates(
        Collection<String> src, Collection<String> dest) {
    Set<String> uniqueSet = new LinkedHashSet<String>(src.size());
    for (String f : src)
        if (!"".equals(f))
            uniqueSet.add(f);

    dest.addAll(uniqueSet);
}
}

Upvotes: 1

Stephen C

Reputation: 718718

I don't really think that I understand why you would want this method, but assuming that it is worthwhile, I would implement it as follows:

public static void copyStringCollectionAndRemoveDuplicates(
        Collection<String> src, Collection<String> dest) {
    if (src == dest) {
         throw new IllegalArgumentException("src == dest");
    }
    dest.clear();
    if (dest instanceof Set) {
        dest.addAll(src);
        dest.remove("");
    } else if (src instance of Set) {
        for (String s : src) {
            if (!"".equals(s)) {
                dest.add(s);
            }
        }
    } else {
        HashSet<String> tmp = new HashSet<String>(src);
        tmp.remove("");
        dest.addAll(tmp);
    }
}

Notes:

This does not preserve the order of the elements in the src argument in all cases, but the method signature implies that this is irrelevant.
I deliberately don't check for null. It is a bug if a null is provided as an argument, and the correct thing to do is to allow a NullPointerException to be thrown.
Attempting to copy a collection to itself is also a bug.

Upvotes: 0

Andreas Dolk

Reputation: 114757

I'd say: Remove it! It's duplicate 'code', the Set is doing the same 'contains()' operation so there is no need to preprocess it here. Unless you have a huge input collection and a brilliant O(1) test for the containsAll() ;-)

The Set is fast enough. It has a O(n) complexity based on the size of the input (one contains() and (maybe) one add() operation for every String) and if the target.containsAll() test fails, contains() is done twice for each String -> less performant.

EDIT

Some pseudo code to visualize my answer

void copy(source, dest) {
  bool:containsAll = true;
  foreach(String s in source) {  // iteration 1
    if (not s in dest) {         // contains() test
       containsAll=false
       break
    }
  }
  if (not containsAll) {
    foreach(String s in source) { // iteration 2
      if (not s in dest) {        // contains() test
        add s to dest
      }
    }
  }
}

If all source elements are in dest, then contains() is called once for each source element. If all but the last source elements are in dest (worst case), then contains() is called (2n-1) times (n=size of source collection). But the total number of contains() test with the extra test is always equal or greater then the same code without the extra test.

EDIT 2 Lets assume, we have the following collections:

source = {"", "a", "b", "c", "c"}
dest = {"a", "b"}

First, the containsAll test fails, because the empty String in source is not in dest (this is a small design flaw in your code ;)). Then you create an temporary set which will be {"a", "b", "c"} (empty String and second "c" ignored). Finally you add everthing to dest and assuming, dest is a simple ArrayList, the result is {"a", "b", "a", "b", "c"}. Is that the intention? A shorter alternative:

void copy(Collection<String> in, Collection<String> out) {
  Set<String> unique = new HashSet<String>(in);
  in.remove("");
  out.addAll(unique);
}

Upvotes: 7

Roman

Reputation: 66156

Too much confusing parameter names. dest and target have almost same meaning. You'd better choose something like dest and source. It'll make things much clearer even for you.
I have a feeling (not sure that it's correct) that you use collections API in a wrong way. Interface Collection doesn't say anything about uniquness of its elements but you add this quality to it.
Modifying collections which passed as parameters is not the best idea (but as usual, it depends). In general case, mutability is harmful and unnecessary. Moreover, what if passed collections are unmodifiable/immutable? It's better to return new collection then modify incoming collections.
Collection interface has methods addAll, removeAll, retainAll. Did you try them first? Have you made performance tests for the code like:
```
Collection<String> result = new HashSet<String> (dest);
result.addAll (target);
```
or
```
target.removeAll (dest);
dest.addAll (target);
```

Upvotes: 1

Daniel Engmann

Reputation: 2850

The containsAll() would not help if target has more elements than dest:
target: [a,b,c,d]
dest: [a,b,c]
target.containsAll(dest) is true, so dest is [a,b,c] but should be [a,b,c,d].

I think the following code is more elegant:

Set<String> uniqueSet = new LinkedHashSet<String>(target.size());
uniqueSet.addAll(target);
if(uniqueSet.contains(""))
    uniqueSet.remove("");

dest.addAll(uniqueSet);

Upvotes: 3

Sean Owen

Reputation: 66876

You could benchmark it, if it mattered that much. I think the call to containsAll() likely does not help, though it could depend on how often the two collections have the same contents.

But this code is confusing. It's trying to add new items to dest? So why does it clear it first? Just instead return your new uniqueSet to the caller instead of bothering. And isn't your containsAll() check reversed?

Upvotes: 2

Java Collection performance question

Answers (6)

Related Questions