R - Cluster x number of events within y time period

Question

I have a dataset that has 59k entries recorded over 63 years, I need to identify clusters of events with the criteria being:

6 or more events within 6 hours

Each event has a unique ID, time HH:MM:SS and date DD:MM:YY, an output would ideally have a cluster ID, the eventS that took place within each cluster, and start and finish time and date.

Thinking about the problem in R we would need to look at every date/time and count the number of events in the following 6 hours, if the number is 6 or greater save the event IDs, if not move onto the next date and perform the same task. I have taken a data extract that just contains EventID, Date, Time and Year.

https://dl.dropboxusercontent.com/u/16400709/StackOverflow/DataStack.csv

If I come up with anything in the meantime I will post below.

Update: Having taken a break to think about the problem I have a new approach.

Add 6 hours to the Date/Time of each event then count the number of events that fall within the start end time, if there are 6 or more take the eventIDs and assign them a clusterID. Then move onto the next event and repeat 59k times as a loop.

Has QUIT--Anony-Mousse · Accepted Answer

Don't use clustering. It's the wrong tool. And the wrong term. You are not looking for abstract "clusters", but something much simpler and much more well defined. In particular, your data is 1 dimensional, which makes things a lot easier than the multivariate case omnipresent in clustering.

Instead, sort your data and use a sliding window.

If your data is sorted, and time[x+5] - time[x] < 6 hours, then these events satisfy your condition.

Sorting is O(n log n), but highly optimized. The remainder is O(n) in a single pass over your data. This will beat every single clustering algorithm, because they don't exploit your data characteristics.

R - Cluster x number of events within y time period

Answers (1)

Related Questions