Reputation: 54697

Efficient look-up in a List

I have a situation whereby I'm populating an ArrayList with "TransactionEvent"s. TransactionEvent has a property "transaction ID". In the large majority of cases each new event has a transaction ID greater the previous event's ID - However, this is not guaranteed; i.e. the data is almost sorted.

My question is this: How can I perform fast look-ups based on transaction ID? My current idea is to call Collections.binarySearch(...) and if this fails then perform a linear search. However, I notice the Javadoc states that the result of binarySearch is undefined is the data is unordered so I may have to roll my own implementation.

Additional:

I have tried using a map of index -> transaction ID but this approach is flawed because whenever an list element is updated / deleted I have to rebuild the entire map; i.e. any gains are erased by this.
This is not a case of premature-optimisation: The List is the basis for a TableModel currently performing very slowly when containing a large number of rows (100,000).

Any help appreciated.

Upvotes: 5

Answers (11)

Justin

Reputation: 4477

I've cleaned things up a bit from my previous post. @Lizzard, your solution is best given the property that new entries are usually at the end. The solution below should perform better if you have random arrivals at the cost of more memory for the maps. It also lets you defer your array insertion (potentially O(n) worst case) until you actually need to draw the cell for a row below the earliest insertion point.

// sorted events (using natural ordering on eventID)
SortedSet<Event> model = new TreeSet<Event>();
ArrayList<Event> sortedList = new ArrayList<Event>();
Event lowestAddition, additionPrevEntry; // low water mark for insertions

public void insert(Event x) {
 if (x < lowestAddition) {
  Set<Event> headSet = model.headSet(x); // find the insertion point
  additionPrevEntry = headSet.isEmpty()?model.last():headSet.first();  
  lowestAddition = x;
 }

 model.add(x);  // add
}

public void materialize() {
 SortedSet<Event> tailSet = model.tailSet(additionPrevEntry);

 Event firstValue = tailSet.first();    // this element does not change its order
 Integer order = firstValue.getOrder(); // keep order on Event
 for (Event x : tailSet) {
  x.setOrder(order);
  sortedList.set(order, x);
  order++;
 }

 lowestAddition = null; additionPrevEntry = null;
}

Here is what your swing code looks like, I assume you are using Swing since you want a table model:

// now your model code uses the array
public Object getValueAt(int row, int col) {
 return getColumn(sortedList.elementAt(row), col);
}

// you can gain significant performance by deferring
// materialization until you acutally need it
public class DeferredJTable extends JTable {
 public void paintComponent(Graphics G, ...) {
  // if you knew what rows in the table were being drawn
  // ahead of time, you could further defer
  materialize();

  super.paintComponent();
 }
}

Upvotes: 0

Imagist

Reputation: 18514

My first answer wasn't really what you were looking for. Now that I understand the problem better, give this a try. I only implemented the key parts. This will be a little bit more memory intensive, but since I'm pretty sure the ArrayList stores the references, not the objects themselves, the memory difference shouldn't be too huge in comparison with the actual object storage.

class TransactionEventStore
{
    private ArrayList<TransactionEvent> byOrder, byId;

    private void insertByOrder(TransactionEvent e) { this.byOrder.add(e); }

    private void insertById(TransactionEvent e)
    {
        for(int i = this.byId.length() - 1; i > 0; i--)
            if(e.getId() > this.byId.get(i).getId())
            {
                this.byId.add(i,e);
                break;
            }
    }

    public void insert(TransactionEvent e)
    {
        this.insertByOrder(e);
        this.insertById(e);
    }
}

Now when you need to lookup by insertion order, look at this.byOrder and when you need to lookup by id, look at this.byId.

Upvotes: 0

Stephan Eggermont

Reputation: 15907

ArrayList is for toy-sized problems. 100.000 rows is getting a little out of toy space. That means you have to be more precise about the access patterns you need to support. A sorted ArrayList might be enough, and if processing speed is growing faster than your problem size, you might not want to bother, but a BTree will be faster at 100K elements.

ArrayList has the following problems with larger problem sizes:

add to the end is slow when the collection has to grow (copy all elements)
insert at a random position is slow because on average half the collection has to be moved one position

A two-level collection with fixed page size (e.g. BTree) can help because a grow will mean adding a (ideally) about sqrt(size) page and a random insert will max split one page in two.

With two needed sort orders, you can simply use two (sorted) BTrees

[edit] The answer to the earlier question is the key to the problem. For a 1000 element ArrayList, the insert costs 7 microseconds, for 1000000 elements 7 milliseconds. The BTree stays in the microseconds range (but could be twice as slow for 1000 element page size).

Indexed acces you can create by keeping an index of the number of elements in each page. If you set a dirty flag on each page you can use a background thread to update the start index of each page, or you can add bulk operations with delayed index building.

The index can be invalid, but it is just sqrt(size) large. For 100K elements, it is just incrementing 150 indexes on average. That takes microseconds, not milliseconds

Upvotes: 1

Justin

Reputation: 4477

Why not just use a sorted collection as your table model instead of a list. TreeMap seems logical since your entries are all ordered. If you also need fast access by row or any other column, you can simply add a secondary map. Basically you are doing what database indexes do.

I thought for some reason that you could use the map.headSet(key) and find the kth entry -- this won't work. You need to be able to get from table row -> EventID (or close to it).

if you use a model like this

Map<EventID, Event> model = new TreeSet<EventID, Event>();

Conceptually your getValueAt() looks like this:

getValueAt(int row, column) {
 eventID = getSortPosition(row);
 Event e = model.headSet(eventID).next();
 return getColumn(e, column);
}

The key is being able to efficiently maintain a map from sort index -> key (an inverse map). This is non-trival since inserting a new event at the very top affects the absolute order of all those below it. It seems like there should be a CS answer here, but it escapes me.

Here is the most basic implementation: - on every insert, you update your map, then materialize your sorted map.

ArrayList<Event> orderedEvents = new ArrayList<Event>();
public void insert(Event event) {
 model.put(event.getID(), event);

 // update the 
 model.headSet().addAll(orderedEvents);
}

Your getValueAt() would be pretty simple.

getValueAt(int row, column) {w);
 Event e = orderedEvents.get(row);
 return getColumn(e, column);
}

this makes insertions O(n) instead of O(n log n) (still not great)

I think you should reconsider your UI design If you are having users browse a 100K row table, adding a search filter will solve your performance problem:

No user will EVER read 100k rows
If it is meaningfull for your users to search by eventID then this works great, when the users selects the eventID, you do: sortedMap.headSet( searchFilterID ) // take first 200 put them into your table
If it is meaningfull for users to search by time, then make a map from and do the same.

Upvotes: 0

Imagist

Reputation: 18514

My vote is that you insert into the list in order. Then you can do a binary search. A few notes:

This will be faster than normal insertions because inserting into an ArrayList near the end is faster than inserting near the beginning (fewer elements need to be moved) and most of your insertions will be at or near the end (because they are almost-ordered).
Normally you would find the insertion point for inserting into an ArrayList using a binary search algorithm. In this case, it's faster to search linearly, starting from the end, since most of your insertions will occur at or near the end.

Upvotes: 0

Eugene Ryzhikov

Reputation: 17359

I had the same problem. The solution I came up with is custom collection based the ArrayList which inlcudes Map of the all the elements too. This is not hard to do. If you'd like me to post the source code - let me know

Upvotes: 0

James

Reputation: 1041

I would use a binary search to get an approximate location of the id and then search outwards linearly. The down side to this is that if the id you're searching for isn't in the list then it will take O(n + log n).

Binary searches are very easy to implement and I recommend reading the wikipedia article.

Upvotes: 0

Winston Smith

Reputation: 21884

From what you have said, it looks like fast look ups are the most important thing here.

So perhaps you should be using a HashMap instead of an ArrayList. In the HashMap, store your TransactionEvents using the TransactionID as the Key. Lookups in a HashMap are O(1).

Note that adding to the HashMap can become quite slow if you exceed its initial capacity - since it has to do a re-hash. If you can, try to initialise it with a best guess (err on the high side) as to the number if items it will hold.

With 100k rows, you might have to increase your java heap size to prevent OutOfMemoryErrors.

java -Xms<initial heap size> -Xmx<maximum heap size>

Defaults are:

java -Xms32m -Xmx128m

EDIT:

If ordering really is important you could use a SortedMap.

Upvotes: 0

James

Reputation: 2066

You could keep your list sorted. If you insertion sort it as you add items, and the items to be added are almost sorted, then the insertions will still effectively run constant time. This would then allow you to binary search in logarithmic time.

Upvotes: 0

Jon

Reputation: 5357

Using a LinkedHashMap, which combines a double linked list which hash access, you should be able to interface with the TableModel as you are with an ArrayList but also access the entries via a hash lookup on TransactionID.

You can even replace (e.g. update) based on a key without affecting the iteration order.

Upvotes: 3

Bill the Lizard

Reputation: 405715

You could keep the ArrayList sorted by searching for the insertion point as you add each TransactionEvent. Collections.binarySearch returns

index of the search key, if it is contained in the list; otherwise, (-(insertion point) - 1). The insertion point is defined as the point at which the key would be inserted into the list: the index of the first element greater than the key, or list.size(), if all elements in the list are less than the specified key. Note that this guarantees that the return value will be >= 0 if and only if the key is found.

Once you search for the insertion point you can use the ArrayList add(int index, Object element) method instead of just adding to the end of the list as you would normally. This will slow down each insertion by a small factor, but it will enable you to use binary search for fast look-up.

Upvotes: 3

Efficient look-up in a List

Answers (11)

Related Questions