Reputation: 4478

Which clustering algorithm to use here?

General Scenario

I have two parameters Location and Time. I need to group the posts that were uploaded from a particular location at a particular time. There is no specific preset location or time. A random interesting event can happen at any time at any location in the location plane. A group of users can start uploading as soon as they find something interesting at a particular location at a particular time. The algorithm needs to detect these posts that were uploaded at that time in that place.

Accepted scene

An event happens at point A(1,3) at 4:00. People start taking photos of the event and start uploading from around the point A,say, some sample upload locations are (1,2),(1.5,3),(1,2.5)) around the time 4:00 like 4:01, 4:10, 4:22 etc

Unaccepted scene

An event happens at point A(1,3) at 4:00. People start taking photos of the event and start uploading from around the point A,say, some sample upload locations are (9,2),(1.5,15),(21,62.5)) around the time 4:00 like 4:01, 4:10, 4:22 etc
An event happens at point A(1,3) at 4:00. People start taking photos of the event and start uploading from around the point A,say, some sample upload locations are (1,2),(1.5,3),(1,2.5)) around the time 4:00 like 14:01, 08:10, 13:22 etc

Understanding so far

From my understand, I see that if this was only based on location, it could have been done using K-Means clustering algorithm. But, since we have Time dimension as well, I need another algorithm that clusters in 3D. I think DBSCAN might do it, but I'm not sure, as I have a really vague understanding.

So, which algorithm might help me here? If not direct answer, I would like some direction towards which I can research, because it's a very vast field to go through every single algo.

EDIT

I tried the following test implementation

from sklearn.cluster import KMeans, MeanShift, DBSCAN
import numpy as np

# Scene one, similar timestamp (turn timestamp into decimal so that the difference is not too large)
# First three are from event 1,
# next 2 are from event 2,
# 3rd one is a random post.
X = np.array([[12.975466, 77.639363, 149794.3292], [12.975273, 77.639358, 149794.3311], [12.975317, 77.639562, 149794.3314],
              [12.973567, 77.635589, 149794.3328], [12.973525, 77.635685, 149794.3336], [12.969739, 77.620912, 149794.3349]])

kmeans = KMeans(n_clusters=3, random_state=0).fit(X)
print('K-Means cluster:', kmeans.labels_)

meanshift = MeanShift().fit(X)
print('MeanShift cluster:', meanshift.labels_)

dbscan = DBSCAN(eps=1).fit(X)
print('DBSCAN cluster:', dbscan.labels_)

Output

K-Means cluster: [2 2 2 0 0 1]
MeanShift cluster: [2 0 0 1 1 1]
DBSCAN cluster: [0 0 0 0 0 0]

Here, K-means clustering works really well at clustering the right points. But, as mentioned in the answers, the disadvantage is that I need to mention the number of clusters in the input, which is not possible because there is practically no idea of knowing it in my application logic.

I tried a MeanShift and DBSCAN as well, but, since I don't know what the proper input to them should be, probably, that's why I am not getting the desired result.

So, how do I get the same result as K-Means clustering without passing the number of output clusters, using the other algorithms?

Upvotes: 0

Answers (3)

raghu

Reputation: 416

There are two aspects to your question.

The first is how to incorporate time as another dimension. To answer this, most clustering algorithms (including k-means) work on multi-dimensional datasets. You can convert your time to a number, and then include it as a third dimension in your data. While doing this, you need to consider what units to use, and how the units of time relate to the units of space. Eg: If your location data units are kilometers, then what should be the time equivalent? Say you arrive at 15 minutes. Then you should scale your time dimension such that 1 unit = 15 minutes. (This can be handled later on in some algorithms as well, but you should think of it nonetheless).

The second is what would be a suitable clustering algorithm to use in this scenario. While k-means is the default algorithm to use, it has the drawback that you need to specify the number of clusters. As the number of data points grow/shrink on a day to day basis in your system, it is not intuitive to think of a fixed number of clusters, and not easy to figure out the relationship between the number of clusters and the number of data points either.

You could try the mean shift algorithm for this use-case. Here, you don't have to specify the number of clusters, and the algorithm will discover this as it goes along. However, you do need to specify the "bandwidth" parameter, which roughly decides whether two points at a certain distance from each other are merged to a single cluster, or stay in their own cluster. You may need some iterations to figure out the right bandwidth as well, but this is likely to stay stable for a given application, unlike the number of clusters.

In general, you will need to try out some runs of the clustering and see what you get, and further tune the parameters.

Upvotes: 1

user11924

Reputation: 153

it is possible to cluster time series of multimedia data using k-means. this paper explains how

Upvotes: 0

c2huc2hu

Reputation: 2497

The k-means algorithm is general and works in any number of dimensions. Thus we need to convert the (time, space) data to three dimensional space.

Say your data is in the format:

data = [
    location: (1, 1), time: "4:00", 
    location: (1, 2), time: "4:01", 
    ... 
]

We need to convert the time axis to a space axis:

def get3DCoordinate(point):
    "tau is a hyper parameter"
    return (location[0], location[1], tau * time.convertToDist())
map(get3DCoordinate, data)

This lets you convert your data to:

data = [
    (1, 1, 960),
    (1, 2, 961),
    ... 
]

These points can be used directly by k means.

Upvotes: 1

Which clustering algorithm to use here?

Answers (3)

Related Questions