Reputation: 11
I’m looking for an algorithm that’s able to group lists of strings who have almost the same content.
This is an example of lists. Totally there are 5 different words.
A = ['first', 'second', 'third']
B = ['first', 'forth']
C = ['second', 'third']
D = ['first', 'third']
E = ['first', 'fifth']
F = ['fourth', 'fifth']
You can see that A, C and D have a lot in common and also B, E and F.
I thought about a clustering algorithmn thats able to give almost the same list a same cluster.
I want to two clusters making sure one word is at least at one cluster.
In this example list A, C and D should have cluster 1
and B, E and F cluster 2.
Is there an algorithm (or machine learning) in Python that can be used for this type of problems?
Upvotes: 0
Views: 1432
Reputation: 88295
This looks like a good use case for a Latent Dirichlet allocation model.
A LDA
is a an unsupervised model that finds similar groups among a set of observations, which you can then use to assign a Topic to each of them.
Here's how you could go about this:
from sklearn.feature_extraction.text import CountVectorizer
import lda
Fit a CountVectorizer
to obtain a matrix of token counts from the list of strings:
l = [' '.join(i) for i in [A,B,C,D,E,F]]
vec = CountVectorizer(analyzer='word', ngram_range=(1,1))
X = vec.fit_transform(l)
Use lda
and fit a model on the result from the CountVectorizer
(there are also other modules with a lda
model implementation, such as in gensim)
model = lda.LDA(n_topics=2, random_state=1)
model.fit(X)
And assign a group number the the 2
created topics:
doc_topic = model.doc_topic_
for i in range(len(l)):
print(f'Cluster {i}: Topic ', doc_topic[i].argmax())
Cluster 0: Topic 1 # -> A
Cluster 1: Topic 0
Cluster 2: Topic 1 # -> C
Cluster 3: Topic 1 # -> D
Cluster 4: Topic 0
Cluster 5: Topic 0
Upvotes: 2