Pieter-Jan Hoet
Pieter-Jan Hoet

Reputation: 11

How to cluster similar lists together?

I’m looking for an algorithm that’s able to group lists of strings who have almost the same content.

This is an example of lists. Totally there are 5 different words.

A = ['first', 'second', 'third']
B = ['first', 'forth']
C = ['second', 'third']
D = ['first', 'third']
E = ['first', 'fifth']
F = ['fourth', 'fifth']

You can see that A, C and D have a lot in common and also B, E and F.

I thought about a clustering algorithmn thats able to give almost the same list a same cluster.

I want to two clusters making sure one word is at least at one cluster.

In this example list A, C and D should have cluster 1

and B, E and F cluster 2.

Is there an algorithm (or machine learning) in Python that can be used for this type of problems?

Upvotes: 0

Views: 1432

Answers (1)

yatu
yatu

Reputation: 88295

This looks like a good use case for a Latent Dirichlet allocation model.


A LDA is a an unsupervised model that finds similar groups among a set of observations, which you can then use to assign a Topic to each of them.

Here's how you could go about this:

from sklearn.feature_extraction.text import CountVectorizer
import lda

Fit a CountVectorizer to obtain a matrix of token counts from the list of strings:

l = [' '.join(i) for i in [A,B,C,D,E,F]]
vec = CountVectorizer(analyzer='word', ngram_range=(1,1))

X = vec.fit_transform(l)

Use lda and fit a model on the result from the CountVectorizer (there are also other modules with a lda model implementation, such as in gensim)

model = lda.LDA(n_topics=2, random_state=1)
model.fit(X)

And assign a group number the the 2 created topics:

doc_topic = model.doc_topic_

for i in range(len(l)):
    print(f'Cluster {i}: Topic ', doc_topic[i].argmax())

Cluster 0: Topic  1 # -> A
Cluster 1: Topic  0
Cluster 2: Topic  1 # -> C
Cluster 3: Topic  1 # -> D
Cluster 4: Topic  0
Cluster 5: Topic  0

Upvotes: 2

Related Questions