Angel
Angel

Reputation: 13

k-means grouping a file in specific cluster in mapreduce

I am a newbie to this hadoop platform.

I have gone through the k-means implementation here . But here we are grouping points. I need to group different files . So the user must be able to see which file falls in which group.

Is there any idea.I searched a lot but not able to find a way. What changes should i make in that code.

**UPDATE2**

I want to input 100 files.

so which one is better-input all the 100 files within a directory or input file1,file2(as single). And how can we handle this

Upvotes: 0

Views: 708

Answers (1)

Karl.Li
Karl.Li

Reputation: 167

What is K-means?

There is no doubt , K-means is the simplest cluster algorithm.

First, k-means is clustering not classification.

Give you a person's name, you don't know it is a man or woman, but your friend who is a man has the same name, so you think he is a man, this is cluster(Acutually, it is a woman, maybe, we are not sure, we just perfer the most likely answer). Give you a man, you are sure he is near you, so you can say he is your neighbour, this is classification.

Clustering files in different files. You should create model(K-means create the model that given a point and falls it to the nearest center point.) and then get the most likely answer.

For example, clustering them by file name. a file called apple maybe in fruit group, a file called mouse maybe in animal group.(All of this is depends on your Points thesaurus)


Now, I will show you an Example of how to vectorize it, hmm, must be the simplest.

Assume you have a Point thesaurus. maybe like this

Bill Gates  : IT
iphone  :  IT
basketball :  sports
Michael Jordan : sports

Four files

file1:  I love iphone very much
file2:  I like play basketball
file3: Bill Gates is the richest man.
file4: He is the fans of Michael Jordan.

We get the key words(which is record in point thesaurus) in file. And then we calc the percent of each file. now, we get the result

file1: 100% IT, 0% sport

file2: 0% IT, 100% sport

file3: 100% IT, 0% sport

file4: 0% IT, 100% sport

Then we can get two group, IT and sports(Usually, in a file, there are lots of words, so 100% and 0% not exist in real data, don't mind the details)


Just have a think, what do this example tell us.

  1. A model is something(in this example, it's just a number, most time, it isiz a formula ) which get the connection between data(in this example, this model get the connection between files)
  2. what is vectorize? Look carefully, you will get the axis(X is the IT, y is the sport)

  3. Why we emphasize vectorize? Now you find one thing! you get axis now, so replace it into K-means, it works now, isn't it?!

Upvotes: 1

Related Questions