cooky
cooky

Reputation:

how to sort numeric vectors?

Now, there are lots of ordered numeric vectors(about 50 dimension). Each dimension is a integer between 0~200. I want to sort them to ensure the similar vectors in the same bucket, all vectors in adjacent buckets have also a similarity to a certain extent. e.g. <1,24,25,78,9> and <2,24,24,78,9> should be in same bucket(bucket number is 010), but <3,29,26,74,11> and <4,28,29,75,10> (they are also in same bucket)is in adjacent buckets (bucket number is 011)

how to design such a sorting function?

Upvotes: 0

Views: 414

Answers (6)

Alex319
Alex319

Reputation: 3918

Perhaps this example is not good. The similarity I wish is that the cosine of angles between two vectors is small. In fact, i want to transform a text to a vector where each dimension corresponds to a concept in the text. I suppose all text have same number of concepts so that their vector dimension are same. – cooky

I assume you mean the cosine is close to 1, so they're both going in close to the same direction. A cosine close to zero means the angle is about orthogonal.

And by the way, if you're doing that, then that means that, say (1,2,3,4,5) and (10,20,30,40,49) should be very close together (despite the actual numbers being far apart). So you want to normalize your vectors prior to using any of the spatial indexing methods.

Upvotes: 0

Ray
Ray

Reputation: 3109

Sounds like point clustering. And it's a very active field. Try googling "vector qantization", k-means clustering, dendrograms, ...

Upvotes: 0

Michael Borgwardt
Michael Borgwardt

Reputation: 346427

What you want is pretty much the exact opposite of a hash: for hashing, you want to avoid having similar values end up in the same bucket.

It seems like you're looking for a spatial index; R-Trees sounds the most promising.

Upvotes: 0

SPWorley
SPWorley

Reputation: 11878

You want a Morton code. It interleaves the bits of each dimension to help keep similar values together. It's often done for 2 and 3D, but it works for any dimension.

Express each of the D values as binary word of B bits, then interleave the bits to form a new D*B bit long number. That's your lookup table number. If you want a smaller number, discard lower bits to get fewer bins.

An even better (but much more annoying to compute) function is a multidimensional Hilbert curve mapping. This is very difficult to work with in practice, but it does have pretty much the best indexing locality you can get.

Upvotes: 2

SingleNegationElimination
SingleNegationElimination

Reputation: 156258

It sounds like you're trying to sort your vectors by locality.

Perhaps what you really want is a generalization of a QuadTree. If each vector of length R is considered a coordinate in R-Space (R==2 -> 2d plane space, r==3 -> 3d space, etc), then you can divide that R-cube in half along each dimension, to get 2R inner R-hyper cubes. continue this process any time a cube contains more than 1 un-equal coordinates. You can then traverse this tree of R-hypercubes to efficiently locate neighboring vectors.

Upvotes: 1

Tim Sylvester
Tim Sylvester

Reputation: 23168

What you describe is not much like a "good" hashing function in the normal sense.

Can you be more specific about how you get from those particular sequences to those bucket numbers (010, 011)? Your hash function is basically going to be that same process.

Upvotes: 0

Related Questions