linhares
linhares

Reputation: 524

High-dimensional clustering from aggregates of observations

I have fallen into this weird high-dimensional clustering problem. Here is an analogy to explain it.

Imagine that 2^10 people enter into a forest, and we want to know how many bird species live there.

These birds differ from each other in, say, 128 dimensions, and all dimensions are binary. That is: either a bird has large beak or small beak, either it has a blue wing or it doesn't, etc. (Each bird species can be represented by 128 bits)

My problem is that when the guys get off the forest, we only have the aggregates of their observations:

"I saw 8 birds, 3 had blue beaks (5 didn't), 4 had blue wings (4 didn't), 1 had a large beak (7 didn't), etc". They do not report on the individual characteristics of their observations, but only on the aggregates of their observations.

There are two additional constraints:

i) all species are observed at least once; ii) The number of species is small (~2^5).

Of course, we can compile the aggregate of their aggregates (of 3000 observations, 357 birds had large beaks, etc..). But what about the clusters?

So the questions are:

  1. How can we find out how many species live there?

  2. How can we find out the characteristics of each species?

Upvotes: 3

Views: 133

Answers (2)

user1149913
user1149913

Reputation: 4523

If x an aggregate observation of a set of birds by a person, then you can approximate it by the matrix product Dz where D is a matrix whose columns represent characteristics of individual birds, and z is a vector of the counts of each bird.

If you assume that only a small number of birds are observed, then this acts as a constraint on the magnitude of z.

This problem is very similar to the sparse dictionary learning problem.

Here are a couple of links that both describe sparse dictionary learning (and related problems) and provide software to solve it: http://spams-devel.gforge.inria.fr/ and http://www.ux.uis.no/~karlsk/dle/index.html

Upvotes: 2

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77474

Since 2^128 = 340282366920938463463374607431768211456, you would need a pretty high sample size to draw valid conclusions. Every bird observed could easily be unique.

Upvotes: 2

Related Questions