Details on Clustering and Classification

Question

I am working on a homework assignment that involves Clustering and Classification and need some help as I am stuck.

I have a file with around 10000 lines each with a random sentence such as

he likes computer science jobs

he has worked in the medical industry before

she likes to play with kids

he has had 5 years experience in computer science field.

I need to to build a multiple clusters out of all the input setences and then put each sentence into a cluster.

For Example:

COMPUTER SCIENCE: he likes computer science jobs
COMPUTER SCIENCE: he has had 5 years experience in computer science field.
KIDS: she likes to play with kids
MEDICAL: he has worked in the medical industry before

Now the Clusters dont need to be called Computer Science, Kids, Medical etc, but they will have number assignements.

What I Have Done:

Read The File and Cleaned each line by REMOVING STOP WORDS, LOWERCASE ENTIRE SENTENCE, REMOVE PUNCTUATION AND OTHER NON ALPHANUMERIC LETTERS, STEM THE WORDS USING PORTER..

Currently I have two things:

a DICT in the format of ID(0-10000): CLEAN SENTENCE
a DICT in the format of WORD: COUNT for each clean word in all 10000 sentences that is unique after being stemmed and cleaned from the string.

What would be my next step? Is this when I implement KNN or KMeans etc?

Abhimanu Kumar · Accepted Answer

Your next step should be to cluster the above cleaned txt where each cleaned sentence is a data point. You can use k-means from any of the data mining python libraries to get the clusters.

======== clustering=========

Now how do you decide the K in the k-means (i.e. the number of clusters): 1) by plotting the objective curve of the k-means and then picking the K that corresponds to the knee, or 2) using Bayesian information criteria, or 3) some other popular methods that suit your particular dataset. If you dont now about this then read up here How do I determine k when using k-means clustering?

Since it is a homework, I will say that learning experience counts more and hence you should try more than one of the above to get a "feel" for it.

At the end of this procedure you will have K clusters.

Now comes the classification part.

======== classification=========

Treat each of the K cluster as one class. There are many ways you can go about classifying each datapoint (i.e. cleaned sentence) into K classes: 1. Whatever cluster each datapoint was assigned to at the end of k-means you can treat this datapoint as having that class. 2. Take each cluster-centroid as the representative point for each class and use some similarity metric such as cosine, kl-divergence etc. to find similarity between a given datapoint and each of K representative class-points. Assign the datapoint to its closest class-point and hence that class.

Note that (1) above is the easiest.

========================================

There are various other methods for clustering (spherical k-means, agglomerative etc.) and that will change your classification step as well.

Details on Clustering and Classification

Answers (1)

Related Questions