Reputation: 1019
I am new to both python and scikit-learn, I am going to cluster bunch of text files ( body of NEWS) , I am using the following code :
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import print_function
import nltk, sklearn, string, os
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.cluster import KMeans
# Preprocessing text with NLTK package
token_dict = {}
stemmer = PorterStemmer()
def stem_tokens(tokens, stemmer):
stemmed = []
for item in tokens:
stemmed.append(stemmer.stem(item))
return stemmed
def tokenize(text):
tokens = nltk.word_tokenize(text)
stems = stem_tokens(tokens, stemmer)
return stems
###########################################################################
# Loading and preprocessing text data
print("\n Loading text dataset:")
path = 'n'
for subdir, dirs, files in (os.walk(path)):
for i,f in enumerate(files):
if f != '.DS_Store':
file_path = subdir + os.path.sep + f
shakes = open(file_path, 'r')
text = shakes.read()
lowers = text.lower()
no_punctuation = lowers.translate(string.punctuation)
token_dict[f] = no_punctuation
###########################################################################
true_k = 3 # *
print("\n Performing stemming and tokenization...")
vectorizer = TfidfVectorizer(tokenizer=tokenize, encoding='latin-1',
stop_words='english')
X = vectorizer.fit_transform(token_dict.values())
print("n_samples: %d, n_features: %d" % X.shape)
print()
###############################################################################
# Do the actual clustering
km = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
y=km.fit(X)
print(km)
print("Top terms per cluster:")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
print("Cluster %d:" % i, end='')
for ind in order_centroids[i, :10]:
print(' %s' % terms[ind], end='')
print()
This code is getting the top words. But what document it is and how can I know which original text files belongs to cluster0, cluster1 or cluster2?
Upvotes: 1
Views: 1911
Reputation: 302
To explain a bit more--you can store the cluster allocations using the follow:
clusters = km.labels_.tolist()
This list will be ordered the same as the dict you passed to your vectorizer.
I just put together a guide to document clustering you might find helpful. Let me know if I can explain anything in more detail: http://brandonrose.org/clustering
Upvotes: 2