user3789843
user3789843

Reputation: 1019

document clustering in python

I am new to both python and scikit-learn, I am going to cluster bunch of text files ( body of NEWS) , I am using the following code :

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import print_function
import nltk, sklearn, string, os
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.cluster import KMeans

# Preprocessing text with NLTK package
token_dict = {}
stemmer = PorterStemmer()

def stem_tokens(tokens, stemmer):
    stemmed = []
    for item in tokens:
        stemmed.append(stemmer.stem(item))
    return stemmed

def tokenize(text):
    tokens = nltk.word_tokenize(text)
    stems = stem_tokens(tokens, stemmer)
    return stems
###########################################################################
# Loading and preprocessing text data
print("\n Loading text dataset:")
path = 'n'

for subdir, dirs, files in (os.walk(path)):
    for i,f in enumerate(files):
        if f != '.DS_Store':
                file_path = subdir + os.path.sep + f
                shakes = open(file_path, 'r')
                text = shakes.read()
                lowers = text.lower()
                no_punctuation = lowers.translate(string.punctuation)
                token_dict[f] = no_punctuation
###########################################################################
true_k = 3 # *
print("\n Performing stemming and tokenization...")
vectorizer = TfidfVectorizer(tokenizer=tokenize, encoding='latin-1',
                              stop_words='english')
X = vectorizer.fit_transform(token_dict.values())
print("n_samples: %d, n_features: %d" % X.shape)
print()
###############################################################################
# Do the actual clustering
km = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
y=km.fit(X)
print(km)

print("Top terms per cluster:")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
    print("Cluster %d:" % i, end='')
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind], end='')
    print()

This code is getting the top words. But what document it is and how can I know which original text files belongs to cluster0, cluster1 or cluster2?

Upvotes: 1

Views: 1911

Answers (1)

brandomr
brandomr

Reputation: 302

To explain a bit more--you can store the cluster allocations using the follow:

clusters = km.labels_.tolist()

This list will be ordered the same as the dict you passed to your vectorizer.

I just put together a guide to document clustering you might find helpful. Let me know if I can explain anything in more detail: http://brandonrose.org/clustering

Upvotes: 2

Related Questions