Radu
Radu

Reputation: 1104

Time based grouping

I have the following problem:

I have a list of objects(containing a date property) which is of size S(usually S will get to about 200k, but there might be cases where it might go up to a million).

I need to filter the list based on the following condition:

Given the large list and what performance issues might arise from an inefficient solution, can you please advise me what would be the best possible approach/algorithm for this problem.

Thanks.

PS:The implementation is in JAVA

Upvotes: 2

Views: 1484

Answers (6)

OldCurmudgeon
OldCurmudgeon

Reputation: 65811

Walk a sub-list of length X down the full list.

At each iteration:

If sublist.tail.date - sublist.head.date > T then mark sublist.head for discard.

Unfortunately you may end up discarding some of the qualifying events but that can be fixed.

Upvotes: 1

Chris Dennett
Chris Dennett

Reputation: 22721

I'd advise using Esper, which is a complex event processor. It uses SQL-like queries to manipulate incoming events, and you can scan existing events in the timeline. Let that handle the data structures.

Upvotes: 1

Adriaan Koster
Adriaan Koster

Reputation: 16209

You could 'stream' the objects (I called them events) through a monitor which flags the events with the desired property.

For example:

public class EventMonitor {

    private int minimumGroupSize;
    private long window;

    private LinkedList<Event> events = new LinkedList<Event>();

    public EventMonitor(int minimumGroupSize, long window) {
        this.minimumGroupSize = minimumGroupSize;
        this.window = window;
    }

    public void handle(Event newest) {

        System.out.println(newest);
        events.addLast(newest);
        if (events.size() == minimumGroupSize) {
            Event oldest = events.peekFirst();
            if (newest.getTimestamp() - oldest.getTimestamp() < window) {
                System.out.println("Group starter: " + oldest);
            }
            events.removeFirst();
        }
    }

    public static class Event {
        private final long timestamp;

        Event(long timestamp) {
            this.timestamp = timestamp;
        }

        public long getTimestamp() {
            return timestamp;
        }

        public String toString() {
            return String.valueOf(timestamp);
        }
    }

    public static void main(String[] args) {

        EventMonitor monitor = new EventMonitor(5, 15);
        feedEventData(monitor);
    }

    private static void feedEventData(EventMonitor monitor) {
        long timestamp = 0;
        for (int i = 0; i < 20; i++) {

            long interval = 1 + (long) (Math.random() * 10);
            timestamp = timestamp + interval;
            monitor.handle(new Event(timestamp));
        }
    }
}

This feeds Events with an interval of 1-10 to the EventMonitor. The monitor keeps track of the most recent minimumGroupSize number of events and prints out the oldest event if the newest event falls within the time window.

NOTE: this example implementation is not threadsafe

Upvotes: 1

G_H
G_H

Reputation: 11999

I could be totally wrong here, but I think people missed the fact that your title used the word "grouping" instead of "filtering" and the part that said "starting from its date property". So here's my take on your question:

  • You have a source list L, containing items of type I that has a timestamp property t.
  • You wish to construct a target list T containing items from L (perhaps in the same order).
  • You have some radius r, expressed as a time unit (milliseconds, seconds, minutes...).
  • You have some positive integer threshold value l (running out of letters here).
  • An item i from L should be placed in T if and only if there are l or more items in L with a timestamp that falls within interval [i.t - r, i.t + r].

Iterating over the list and counting that number of items within the interval for each entry would be very expensive, assuming the list isn't sorted. This would lead to an O(n²) performance.

If the order of the items in T may differ from the items in L, then the following is an effective solution:

  1. Sort the items in L. If the item type I implements the Comparable interface and the ordering happens by timestamp, then it'll be simple. Otherwise, you might need to implement a Comparator that sorts on the timestamp property. Sorting cannot perform better than O(n*log n), so that's the bottom line for now.

  2. Make three indexes: s (for start), e (for end), c (for current item).

  3. Iterate over your sorted list with index c. For each entry, do the following:

    3.1 Starting from index s, count how many items no longer fall within the "radius". That is, how many items have a timestamp that is lower than the timestamp of the item at index c minus the time radius. Add that value to index s. This makes it point at the "earliest" item that falls within radius.

    3.2 Starting from index e, check items in the sorted list to see if they fall within the radius. Let e point at the last item that does.

    3.3 Take value (e - s). If it falls above threshold l, that means the item at index c passes the filter. Otherwise, it doesn't.

    3.4 Increment c. If c is now > e, increment e as well.

That's it. Step 3 checks each item in the sorted list once so that's O(n) performance. The worst-case scenario depends on the timestamps and the radius. You don't start searching for the lower limit from the start of the list each time, but from a trailing index s. You don't start searching for the upper limit from c each time, but from a leading index e. So I believe the worst-case performance is still O(n) since s and e can only traverse the list once as well. Suppose you have 100 entries and from entry 3 you find that all remaining entries fall within radius (that is, e becomes 99), you'll never have to check further.

With one O(n*log n) step and one O(n) step, the amortized performance becomes O(n*log n). Seems quite acceptable.

Mind that you'll need additional steps if the filtered list must maintain the original item order. In that case some kind of index list could be of use.

EDIT: I just realized you may have literally meant "starting from", so if that's the case, simply ignore the trailing index s and only work with leading index e. The algorithm remains the same otherwise, apart from taking (e - c) instead of (e - s). I also had some "frame list" in the pre-edit version which was obviously nonsense since the indexes are enough to calculate the number needed.

Upvotes: 2

sysoutnull
sysoutnull

Reputation: 124

You'd need a:

class Foo { Date date; } 
class Event { Period period; }
class Period { Date start, end; }

List<Foo> foos = ...
List<Event> events = ...
Period p = new Period(...)

[pseudo-code]
foreach foo in foos: 
    eventsAfterFoo = findEventsAfter(foo.date, events);
    c = 0;
    foreach event in eventsAfterFoo:
        if(event.isInPeriod(p)) c++
    if(c >= X) 
        finalList.add(foo) 

For a large number of elements you'd certainly simplify your solution using a database, even a non mainstream one like HSQL.

You could certainly split the lists into different VMs on different machines, the only shared/readonly list would be 'events'.

Upvotes: 1

Thomas
Thomas

Reputation: 88707

I don't think I fully understand your problem, but if the list is sorted by date you should be able to determine whether the following X items are all in the interval T, otherwise delete them.

If X gets high you might consider using a different sorted structure, e.g. a TreeMap where the key is the date. Then you could do something like this:

SortedMap<Date, YourObject> map = ...;//left for you

int numEventsInInterval = map.subMap( intervalStart, intervalEnd ).size();

Upvotes: 1

Related Questions