Reputation: 1104
I have the following problem:
I have a list of objects(containing a date property) which is of size S(usually S will get to about 200k, but there might be cases where it might go up to a million).
I need to filter the list based on the following condition:
Given the large list and what performance issues might arise from an inefficient solution, can you please advise me what would be the best possible approach/algorithm for this problem.
Thanks.
PS:The implementation is in JAVA
Upvotes: 2
Views: 1484
Reputation: 65811
Walk a sub-list of length X down the full list.
At each iteration:
If sublist.tail.date - sublist.head.date > T then mark sublist.head for discard.
Unfortunately you may end up discarding some of the qualifying events but that can be fixed.
Upvotes: 1
Reputation: 22721
I'd advise using Esper, which is a complex event processor. It uses SQL-like queries to manipulate incoming events, and you can scan existing events in the timeline. Let that handle the data structures.
Upvotes: 1
Reputation: 16209
You could 'stream' the objects (I called them events) through a monitor which flags the events with the desired property.
For example:
public class EventMonitor {
private int minimumGroupSize;
private long window;
private LinkedList<Event> events = new LinkedList<Event>();
public EventMonitor(int minimumGroupSize, long window) {
this.minimumGroupSize = minimumGroupSize;
this.window = window;
}
public void handle(Event newest) {
System.out.println(newest);
events.addLast(newest);
if (events.size() == minimumGroupSize) {
Event oldest = events.peekFirst();
if (newest.getTimestamp() - oldest.getTimestamp() < window) {
System.out.println("Group starter: " + oldest);
}
events.removeFirst();
}
}
public static class Event {
private final long timestamp;
Event(long timestamp) {
this.timestamp = timestamp;
}
public long getTimestamp() {
return timestamp;
}
public String toString() {
return String.valueOf(timestamp);
}
}
public static void main(String[] args) {
EventMonitor monitor = new EventMonitor(5, 15);
feedEventData(monitor);
}
private static void feedEventData(EventMonitor monitor) {
long timestamp = 0;
for (int i = 0; i < 20; i++) {
long interval = 1 + (long) (Math.random() * 10);
timestamp = timestamp + interval;
monitor.handle(new Event(timestamp));
}
}
}
This feeds Events with an interval of 1-10 to the EventMonitor. The monitor keeps track of the most recent minimumGroupSize number of events and prints out the oldest event if the newest event falls within the time window.
NOTE: this example implementation is not threadsafe
Upvotes: 1
Reputation: 11999
I could be totally wrong here, but I think people missed the fact that your title used the word "grouping" instead of "filtering" and the part that said "starting from its date property". So here's my take on your question:
Iterating over the list and counting that number of items within the interval for each entry would be very expensive, assuming the list isn't sorted. This would lead to an O(n²) performance.
If the order of the items in T may differ from the items in L, then the following is an effective solution:
Sort the items in L. If the item type I implements the Comparable
interface and the ordering happens by timestamp, then it'll be simple. Otherwise, you might need to implement a Comparator
that sorts on the timestamp property. Sorting cannot perform better than O(n*log n), so that's the bottom line for now.
Make three indexes: s (for start), e (for end), c (for current item).
Iterate over your sorted list with index c. For each entry, do the following:
3.1 Starting from index s, count how many items no longer fall within the "radius". That is, how many items have a timestamp that is lower than the timestamp of the item at index c minus the time radius. Add that value to index s. This makes it point at the "earliest" item that falls within radius.
3.2 Starting from index e, check items in the sorted list to see if they fall within the radius. Let e point at the last item that does.
3.3 Take value (e - s). If it falls above threshold l, that means the item at index c passes the filter. Otherwise, it doesn't.
3.4 Increment c. If c is now > e, increment e as well.
That's it. Step 3 checks each item in the sorted list once so that's O(n) performance. The worst-case scenario depends on the timestamps and the radius. You don't start searching for the lower limit from the start of the list each time, but from a trailing index s. You don't start searching for the upper limit from c each time, but from a leading index e. So I believe the worst-case performance is still O(n) since s and e can only traverse the list once as well. Suppose you have 100 entries and from entry 3 you find that all remaining entries fall within radius (that is, e becomes 99), you'll never have to check further.
With one O(n*log n) step and one O(n) step, the amortized performance becomes O(n*log n). Seems quite acceptable.
Mind that you'll need additional steps if the filtered list must maintain the original item order. In that case some kind of index list could be of use.
EDIT: I just realized you may have literally meant "starting from", so if that's the case, simply ignore the trailing index s and only work with leading index e. The algorithm remains the same otherwise, apart from taking (e - c) instead of (e - s). I also had some "frame list" in the pre-edit version which was obviously nonsense since the indexes are enough to calculate the number needed.
Upvotes: 2
Reputation: 124
You'd need a:
class Foo { Date date; }
class Event { Period period; }
class Period { Date start, end; }
List<Foo> foos = ...
List<Event> events = ...
Period p = new Period(...)
[pseudo-code]
foreach foo in foos:
eventsAfterFoo = findEventsAfter(foo.date, events);
c = 0;
foreach event in eventsAfterFoo:
if(event.isInPeriod(p)) c++
if(c >= X)
finalList.add(foo)
For a large number of elements you'd certainly simplify your solution using a database, even a non mainstream one like HSQL.
You could certainly split the lists into different VMs on different machines, the only shared/readonly list would be 'events'.
Upvotes: 1
Reputation: 88707
I don't think I fully understand your problem, but if the list is sorted by date you should be able to determine whether the following X items are all in the interval T, otherwise delete them.
If X gets high you might consider using a different sorted structure, e.g. a TreeMap
where the key is the date. Then you could do something like this:
SortedMap<Date, YourObject> map = ...;//left for you
int numEventsInInterval = map.subMap( intervalStart, intervalEnd ).size();
Upvotes: 1