matt525252
matt525252

Reputation: 742

Tag based co-occurrence image clustering

I labeled lots of object images using Google Vision API. Using those labels (list in pickle here), I created a label co-occurrence matrix (download as numpy array here). Size of the matrix is 2195x2195.

Loading the data:

import pickle
import numpy as np
with open('labels.pkl', 'rb') as f:
    labels = pickle.load(f)

cooccurrence = np.load('cooccurrence.npy')

I would like to use a clustering analysis to define reasonable amount of clusters (defined as lists of Vision labels) which would represent some objects (e.g. cars, shoes, books, ....). I do not know what is the right number of clusters.

I tried hierarchical clustering algorithm available in scikit-learn:

import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_colwidth', 1000)

#creating non-symetrical "similarity" matrix:
occurrences = cooccurrence.diagonal().copy()
similarities = cooccurrence / occurrences[:,None]

#clustering:
from sklearn.cluster import AgglomerativeClustering
clusters = AgglomerativeClustering(n_clusters=200, affinity='euclidean', linkage='ward').fit_predict(similarities)

#results in pandas:
df_clusters = pd.DataFrame({'cluster': clusters.tolist(), 'label': labels})
df_clusters_grouped = df_clusters.groupby(['cluster']).agg({'label': [len, list]})
df_clusters_grouped.columns = [' '.join(col).strip() for col in df_clusters_grouped.columns.values]
df_clusters_grouped.rename(columns = {'label len': 'cluster_size', 'label list': 'cluster_labels'}, inplace=True)
df_clusters_grouped.sort_values(by=['cluster_size'], ascending=False)

Like this, I was able to create 200 clusters where one can look like:

["Racket", "Racquet sport", "Tennis racket", "Rackets", "Tennis", "Racketlon", "Tennis racket accessory", "Strings"]

This somehow works, but I would rather use some soft clustering method which would be able to assign one label to multiple clusters (for instance "leather" might make sense for shoes and wallets). Also, I had to define number of clusters (200 in my example code), which is something I would rather get as a result (if possible).

I was also playing with hdbscan, k-clique and Gaussian mixture models but I did not come up with any better output.

Upvotes: 1

Views: 457

Answers (1)

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77485

Clustering methods such as AgglomerativeClustering of sklearn require a data matrix as input. With metric="precomputed" you can also use a distance matrix (it for k-means and Gaussian mixture modeling, these do need coordinate data).

You, however, have a cooccurrence or simarity matrix. These values have the opposite meaning, so you'll have to identify an appropriate transformation (for example occurrences-cooccurrences). Treating the cooccurrence matrix as data matrix (and then using Euclidean distance - that is what you do) works to some extend but has very weird semantics and is not recommended.

Upvotes: 1

Related Questions