user1717931
user1717931

Reputation: 2501

cluster using feature hashing

I have to cluster some documents that are in json-format. I would like to tinker with feature-hashing to reduce the dimensions. To begin small, here is my input:

doc_a = { "category": "election, law, politics, civil, government",
          "expertise": "political science, civics, republican"
        }

doc_b = { "category": "Computers, optimization",
          "expertise": "computer science, graphs, optimization"
        }
doc_c = { "category": "Election, voting",
          "expertise": "political science, republican"
        }
doc_d = { "category": "Engineering, Software, computers",
          "expertise": "computers, programming, optimization"
        }
doc_e = { "category": "International trade, politics",
          "expertise": "civics, political activist"
        }

Now, how do I go about using feature hashing, create vectors for each document and then compute similarity and create clusters? I am a bit lost after reading http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html. Not sure if I have to use "dict" or convert my data to have some ints and then use 'pair' for 'input_type' to my featureHasher. How should I interpret the output of featureHasher? For example, the example http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html outputs a numpy array.

In [1]: from sklearn.feature_extraction import FeatureHasher

In [2]: hasher = FeatureHasher(n_features=10, non_negative=True, input_type='pair')

In [3]: x_new = hasher.fit_transform([[('a', 1), ('b', 2)], [('a', 0), ('c', 5)]])

In [4]: x_new.toarray()
Out[4]:
array([[ 1.,  2.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  5.,  0.,  0.]])

In [5]:

I think the rows are documents and column values are ..? Say, if I want to cluster or find similarity between these vectors (using Cosine or Jaccard), not sure if I have to do item-wise comparison?

expected output: doc_a, doc_c and doc_e should be in one cluster and the rest in another cluster.

Thanks!

Upvotes: 1

Views: 885

Answers (1)

Ryan Walker
Ryan Walker

Reputation: 3286

You'll make things easier on yourself if you use the HashingVectorizer instead of the FeatureHasher for this problem. The HashingVectorizer takes care of tokenizing your input data and can accept a list of strings.

The main challenge with the problem is that you actually have two kinds of text features, category and expertise. The trick in that case is to fit a hashing vectorizer for both features and then combine the output:

from sklearn.feature_extraction.text import HashingVectorizer
from scipy.sparse import hstack
from sklearn.cluster import KMeans

docs = [doc_a,doc_b, doc_c, doc_d, doc_e]

# vectorize both fields separately
category_vectorizer = HashingVectorizer()
Xc = category_vectorizer.fit_transform([doc["category"] for doc in docs])

expertise_vectorizer = HashingVectorizer()
Xe = expertise_vectorizer.fit_transform([doc["expertise"] for doc in docs])

# combine the features into a single data set
X = hstack((Xc,Xe))
print("X: %d x %d" % X.shape)
print("Xc: %d x %d" % Xc.shape)
print("Xe: %d x %d" % Xe.shape)

# fit a cluster model
km = KMeans(n_clusters=2)

# predict the cluster
for k,v in zip(["a","b","c","d", "e"], km.fit_predict(X)):
    print("%s is in cluster %d" % (k,v))

Upvotes: 1

Related Questions