Reputation: 2790
I'm new at using scikit-learn and I'm trying to clusterize people given their interest in movie. I create a sparse matrix that got different columns (one for each movie) and rows. For a given cell it's 0 or 1 if the user liked the movie or not.
sparse_matrix = numpy.zeros(shape=(len(list_user), len(list_movie)))
for id in list_user:
index_id = list_user.index(id)
for movie in list_movie[index_id]:
if movie.isdigit():
index_movie = list_movie.index(int(movie))
sparse_matrix[index_id][index_movie] = 1
pickle.dump(sparse_matrix, open("data/sparse_matrix", "w+"))
return sparse_matrix
I consider this like an array of vectors and from the doc this is an acceptable input.
Perform DBSCAN clustering from vector array or distance matrix.
So I try to do some thing to use scikit-learn:
sparse_matrix = pickle.load(open("data/sparse_matrix"))
X = StandardScaler().fit_transform(sparse_matrix)
db = DBSCAN(eps=1, min_samples=20).fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
print labels
I did this using the example DBSCAN from scikit-learn. I have two question, the first one is: "is my matrix well formatted and suitable for this algorithm?" I've got this concern when I consider the number of dimension. The second question is "how I set the epsilon parameter (minimal distance between my point)?"
Upvotes: 0
Views: 1420
Reputation: 77454
See the DBSCAN article for a suggestion how to choose epsilon based on the k-distance graph.
Since your data is sparse, it probably is more appropriate to use e.g. Cosine distance rather than Euclidean distance. You should also use a sparse format. For all I know, numpy.zeros
will create a dense matrix:
sparse_matrix = numpy.zeros(...)
is therefore misleading, because it is a dense matrix, just with mostly 0s.
Upvotes: 2