Reputation: 173
I'm trying to implement K-means clustering algorithm on 600 data points, all with 60 dimensions. A line of input would be something like:
28.7812 34.4632 31.3381 31.2834 28.9207 33.7596 25.3969 27.7849 35.2479 27.1159 32.8717 29.2171 36.0253 32.337 34.5249 32.8717 34.1173 26.5235 27.6623 26.3693 25.7744 29.27 30.7326 29.5054 33.0292 25.04 28.9167 24.3437 26.1203 34.9424 25.0293 26.6311 35.6541 28.4353 29.1495 28.1584 26.1927 33.3182 30.9772 27.0443 35.5344 26.2353 28.9964 32.0036 31.0558 34.2553 28.0721 28.9402 35.4973 29.747 31.4333 24.5556 33.7431 25.0466 34.9318 34.9879 32.4721 33.3759 25.4652 25.8717
I'm thinking have a struct of data points, where it has a vector of attributes, like
struct Point{
std::vector<double> attributes;
};
and I guess when iterating through all the points, add up the attributes with i as an iterator in a for loop? Is this the best way to go about this?
Upvotes: 1
Views: 123
Reputation: 179779
600 data points is a small enough number. Looking up distances in 60 dimensional space, for 600 points is about 36.000 operations. That's manageable.
Keep in mind that your data is very, very sparse though. A more realistic data set for 60 dimensions would have far, far more points. In that case you might need to think about pre-partioning space. That would complicate your data structure.
One intermediate-level technique is to realize that distances only add up. When looking for a neighbor of point P, you need to calculate the distance to the first point in 60 dimensions. This establishes a lower bound D. But when you calculate the distance to the second point, you may find that you exceed D already after 59 dimensions. Now the tricky bit is that you cannot check this for every point after adding every dimension; that would be excessive. You may need to manually unroll loops, and exactly how depends on your data distribution.
Upvotes: 1
Reputation: 1
Not sure about what you are asking, but with C++11 you could use std::array so you might have some
std::vector<std::array<double,60>> myvec;
Then myvec[2]
and myvec[10]
(assuming myvec.size() > 10
) are both elements of type std::array<double,60>
so you can of course use myvec[2][7]
and myvec[10][59]
Upvotes: 5