panzerschreck
panzerschreck

Reputation: 3642

Best datastructure for frequently queried list of objects

I have a list of objects say, List. The Entity class has an equals method,on few attributes ( business rule ) to differentiate one Entity object from the other.

The task that we usually carry out on this list is to remove all the duplicates something like this :

List<Entity> noDuplicates = new ArrayList<Entity>();
for(Entity entity: lstEntities)
{
    int indexOf = noDuplicates.indexOf(entity);
    if(indexOf >= 0 )
    {
            noDuplicates.get(indexOf).merge(entity);
    }
    else
    {
            noDuplicates.add(entity);
     }
}

Now, the problem that I have been observing is that this part of the code, is slowing down considerably as soon as the list has objects more than 10000.I understand arraylist is doing a o(N) search.

Is there a faster alternative, using HashMap is not an option, because the entity's uniqueness is built upon 4 of its attributes together, it would be tedious to put in the key itself into the map ? will sorted set help in faster querying ?

Thanks

Upvotes: 5

Views: 631

Answers (6)

Tim Bender
Tim Bender

Reputation: 20442

Two simple steps for an O(N*Log(N)) algorithm:

  1. Sort the list using a comparator based on the four important fields
  2. Iterate over the list comparing each item to the next in the list, if they are equal, merge them and remove one.

Upvotes: 0

CPerkins
CPerkins

Reputation: 9008

Unless you have a reason for needing the ordering of a List, you'd probably be best off with a Set - specifically, a HashSet.

I see your concern about using a hashed collection because "the entity's uniqueness is built upon 4 of its attributes together", but that's easily overcome. You just have to define a hashcode() method which is compatible with your existing equals() method, and then you can insert your entities into a Set, and as a magic side effect, never have to remove duplicates again.

Upvotes: 0

matt b
matt b

Reputation: 139921

Now, the problem that I have been observing is that this part of the code, is slowing down considerably as soon as the list has objects more than 10000.I understand arraylist is doing a o(N) search.

The algorithm you posted is actually worse than O(N)

  • Iterating through the input list lstEntities - O(N)
  • within this loop, you are calling ArrayList.indexOf(T) which has to scan the list - O(N) again

You algorithm is actually O(N^2) since you are potentially scanning the list twice within a loop.

It sounds like you what you want to do is actually two operations:

  1. From the input List, remove any duplicates
  2. When you find duplicates, "merge" the entities.

You can do this by scanning the list just once, rather than in nested loops. I would recommend breaking up your Entity to move the fields that "identify" an Entity into another type, such as ID, or at the very least add a getID() method which can return these fields grouped into a single type. This way you can easily build a Map between the two types to be able to merge entities with "duplicate" identities. This might look something like this:

Map<ID, Entity> map = new HashMap<ID, Entity>(inputList.size());
for (Entity e : inputList) {
    Entity existing = map.get(e.getID());
    if (existing == null) {
        //not in map, add it
        map.put(e.getID(), e);
    } 
    else {
        existing.merge(e);
    }
}

Iterating through the list is O(n) while HashMap.get(K) is a constant-time operation.

Upvotes: 2

Lars Andren
Lars Andren

Reputation: 8771

An idea is to use a Set instead of a List, there are no duplicates in a Set. To remove duplicates from a list, you could just add the List to a new Set

List<Entity> list = //your list.
Set<Entity> set = new HashSet<Entitiy>();
set.addAll(list);

But then again, maybe there is some reason for using a List in the first place? If not, you could use a Set instead, and not have to worry about any duplicates.

EDIT

There is no index reference of the elements in a Set (as compared to a List, where you can do get(int index)). The elements in a Set are floating around without a specific point of reference.

If you need to find a specific one, you need to iterate through them all. If that is not ok and/or you can't be without the indexed reference - that allows for get(int index) and remove(int index) - I guess Set is not an option for you.

Upvotes: 2

Daniel Martin
Daniel Martin

Reputation: 23548

It all depends on what that merge operation is doing. Does merge change any of the attributes that are compared when you do equals? If not, then you'll be amazed at how much faster it will be if you do this:

First, define a hashCode for your Entity class that is compatible with your definition of equals. One common way to do this is:

public int hashCode() {
  // assuming the four attributes that determine equality are called
  // attrFoo, attrBar, attrBaz, and attrQux
  int hash = 1;
  hash += attrFoo == null ? 0 : attrFoo.hashCode();
  hash *= 37;
  hash += attrBar == null ? 0 : attrBar.hashCode();
  hash *= 37;
  hash += attrBaz == null ? 0 : attrBaz.hashCode();
  hash *= 37;
  hash += attrQux == null ? 0 : attrQux.hashCode();

  return hash;
}

Then, use a HashMap so that you can find these things:

Map<Entity, Entity> map = new HashMap<Entity, Entity>();
for(Entity entity: lstEntities) {
  if (map.containsKey(entity)) {
    map.get(entity).merge(entity);
  } else {
    map.put(entity, entity);
  }
}
return map.values();  // or keys().  Whichever.

I should note that I feel a bit dirty writing the above code, because you really shouldn't make Map keys that aren't immutable, but this will work and be much, much faster than what you're doing now.

Upvotes: 1

user206428
user206428

Reputation:

Instead of a list structure, you could use a set (more appropriate if you are concerned about Entity uniqueness), as Lars has suggested. Additionally, if performance is a problem, I'd look at using a TreeSet and implement a Comparator for comparing Entity instances based on their attributes. The tree structure will allow for fast (logarithmic complexity) insert, delete and retrieval operations.

Upvotes: 3

Related Questions