Recently I was faced with this interview question (K-Means Clustering solution). The design I came up with did not meet the expectations of the interviewer (to put simply I didnt get the job because I lost to another candidate on this design problem). I am wondering how many different / efficient / simply solutions can the SO community come up with (by doing this I am hoping to hone my skills): To implement a simple algorithm to cluster people according to their weight and height. The data set includes a list of people with their weights and heights like so: Person Weight Height (kg) (inches) Person 1 70 70 Person 2 75 80 Person 3 120 85 You can plot the data as a 2 dimensional data. Weight being one dimension and height being the other dimension. Weight can range from a minimum of 50kg to 150kg. Height can range from a minimum of 38inches to 90inches Algorithm: The algorithm (called K-means clustering) will cluster data into K groups goes as such: Start with K clusters. Each cluster is defined by its center point which will start of as random weight and random height. Pick random numbers from within the corresponding ranges defined above. For each person Calculate distance to center of each cluster using formula distance = sqrt(pow((wperson−wcenter), 2) + (pow(hperson−hcenter),2)) where wperson = weight of person, hperson = height of person wcenter = weight of cluster center point, hcenter = height of cluster center point Assign the person to the cluster with the shortest distance to center point of cluster After end of step 2, you will end up with K clusters each assigned with a set of people For each cluster, set the weight and height of the center point to the average of the people in the cluster wcenter = (sum of weight of each person in cluster)/(number of people in cluster) hcenter = (sum of height of each person in cluster)/number of people in cluster) Repeat steps 2 to 5 for 1000 iterations, then print out following information for each cluster. weight and height of center of cluster. list of people in cluster. I am not looking for a implementation/solution but for a high level design. can you list the interfaces / classes etc. I dont want to give my solution now, but will post it later in the day?

Reputation: 27

Design Problem

Recently I was faced with this interview question (K-Means Clustering solution). The design I came up with did not meet the expectations of the interviewer (to put simply I didnt get the job because I lost to another candidate on this design problem). I am wondering how many different / efficient / simply solutions can the SO community come up with (by doing this I am hoping to hone my skills):

To implement a simple algorithm to cluster people according to their weight and height. The data set includes a list of people with their weights and heights like so:

Person   Weight   Height
         (kg)     (inches)
Person 1 70        70

Person 2 75        80

Person 3 120       85

You can plot the data as a 2 dimensional data. Weight being one dimension and height being the other dimension. Weight can range from a minimum of 50kg to 150kg. Height can range from a minimum of 38inches to 90inches

Algorithm:

The algorithm (called K-means clustering) will cluster data into K groups goes as such:

Start with K clusters. Each cluster is defined by its center point which will start of as random weight and random height. Pick random numbers from within the corresponding ranges defined above.
For each person Calculate distance to center of each cluster using formula distance = sqrt(pow((wperson−wcenter), 2) + (pow(hperson−hcenter),2)) where wperson = weight of person, hperson = height of person wcenter = weight of cluster center point, hcenter = height of cluster center point
Assign the person to the cluster with the shortest distance to center point of cluster
After end of step 2, you will end up with K clusters each assigned with a set of people
For each cluster, set the weight and height of the center point to the average of the people in the cluster wcenter = (sum of weight of each person in cluster)/(number of people in cluster) hcenter = (sum of height of each person in cluster)/number of people in cluster)
Repeat steps 2 to 5 for 1000 iterations, then print out following information for each cluster.

weight and height of center of cluster. list of people in cluster.

I am not looking for a implementation/solution but for a high level design. can you list the interfaces / classes etc. I dont want to give my solution now, but will post it later in the day?

Upvotes: 0

Answers (5)

Fabio Vinicius Binder

Reputation: 13214

You can create the following classes:

Person to store data about persons and centers. Properties: id, weight and height. Method: calculateDistance
Cluster to store one center and a list of persons: Properties: center and list of Person. Method: calculateCenter.
KCluster to hold your algorithm and store a list of clusters: Property: list of Cluster. Methods: generateClusters.

Upvotes: 1

Chap

Reputation: 2796

This is my attempt at the design. I only show the static diagram since the algorithm is pretty much laid out already. I would have a plan to have a visitor for the representation of the clusters, could allow different types of output (xml, strings, csv..etc). Maybe the visitor is overkill, if it was then I'd just have something like a ToString method that could be overridden.

Note: the Cluster creates a CenterClusterItem on the SetCenter and FindNewCenter methods. The CenterClusterItem is not a PersonClusterItem, it just holds the same amount of AClusterValues as a PersonClusterItem would (since the average isn't really a person).

Also, I forgot to make a method on the KCluster to begin the process, but that's implied.

Class Diagram http://img11.imageshack.us/img11/499/kcluster.png

Upvotes: 2

j_random_hacker

Reputation: 51226

That sounds like a really good way to do it. K-means will usually converge quickly (though not necessarily to the global optimum), so my one suggestion would be to run the algorithm until no more changes occur, rather than a fixed number of 1000 iterations. You could then repeat the entire process a few times with different random starting points.

One weakness of k-means is that it does require you to specify (i.e. guess) an appropriate value for k up-front. I think you would get points for asking the interviewer what an appropriate value for k would be, or, if there is no way to know, describing some goodness-of-fit measure and then calculating that measure for different values of k to find a "just low enough" value.

Upvotes: 0

Roland Ewald

Reputation: 4614

Well, I would first tackle all the constants/magic numbers that reduce the reusability of the algorithm:

instead of a fixed number of iterations, use a stopping criterion (e.g., if clusters don't change too much, terminate)
don't restrict yourself to 2-dim data, use vectors
let the user define the number of clusters to be found

Then, you could hide some specifics behind interfaces, e.g. the distance might be calculated differently (for example, it might at some point have to cope with values other than double).

On the other hand, if you really have this simple problem, some of these generalizations might well be overkill - but that's what I would discuss with someone telling me to implement this algorithm.

Upvotes: 1

Robin Day

Reputation: 102478

I'm not sure what your question actually is, the steps you point out effectively define the algorithm you're talking about.

A better idea may be to include exactly what you did then people can give you some hints / tips as to where you might have gone wrong or what they would have done differently.

Upvotes: 0

Design Problem

Answers (5)

Related Questions