Ahdee
Ahdee

Reputation: 4949

Automatically learning clusters

HI complete newbie question here: I have a table consisting of two columns. First column belongs to "bins" that are coded by where a the fruit flies live. The second column is either 0 or 1, neutral vs really like sugar, respectively. I have two question?

1) if I suspect that there is a single variable, something about where they live that is determining whether how much they like sugar. Is there a way that I can have the computer to group into just 2 clusters? All the bins that like sugar vs neutral. That way we can do further experiment to determine what is it about the bins.

2) automatically determine how many clusters there might be that is driving this behavior? For example may be there is 4 variables (4 clusters) that can determine the outcome of sugar preference.

Apologies if this is trivial. The table is listed below. thanks!

Bin sugar
1   1
1   1
1   0
1   0
2   1
2   0
2   0
3   1
3   0
3   1
3   1
4   1
4   1
4   1
5   1
5   0
5   1
6   0
6   0
6   0
7   0
7   1
7   1
8   1
8   0
8   1
9   1
9   0
9   0
9   0
10  0
10  0
10  0
11  1
11  1
11  1
12  0
12  0
12  0
12  0
13  0
13  0
13  1
13  0
13  0
14  0
14  0
14  0
14  0
15  1
15  0
15  0
16  1
16  1
17  1
17  1
18  0
18  1
18  1
17  1
19  1
20  1
20  0
20  0
20  1
21  0
21  0
21  1
21  0
22  1
22  0
22  1
22  1
23  1
23  1
24  1
24  0
25  0
25  1
25  0
26  1
26  1
27  1
27  1

Upvotes: 0

Views: 90

Answers (1)

mp85
mp85

Reputation: 422

Okay, assuming I understood what you meant, one approach to problem 1) should be addressed using bayes filtering. Say event L is "a fly likes sugar", event B is "a fly is in bin B".

So what you have is:

number of flies = 84    
size of each bins = (eg size of bin 1: 4)    

probability that a fly likes sugar:

P(L) = flies that like sugar / total number of flies = 43/84

probability that a fly doesn't like sugar:

P(notL) = 1 - P(L) = 41/84

probability that a fly is in a given bin:

P(B) = size of the bin / sum of the sizes of all bins = 4/84 (for bin 1)

probability that a fly isn't in a given bin:

P(notB) = 1 - P(B) = 80/84 (for bin 1)

probability that a fly likes sugar, knowing that's in bin B:

P(L|B) = flies that like sugar in a bin / size of the bin
(eg for bin 1 is 2/4 = 1/2)

probability that a fly likes sugar, knowing that it's not in bin B:

P(L|notB) = (total flies that like sugar - flies that like sugar in the bin)/(size of bins - size of the bin)) = 41/80

You want to know the probability that a fly is in a given bin B knowing that likes sugar, which you can obtain with:

P(B|L) = (P(L|B) * P(B)) / (P(L|B) * P(B) + P(L|notB) * P(notB))

If you compute P(B|L) and P(B|notL) for each bin, then you know which of the bins have the highest probability of containing flies that like sugar. Then you can further study those bins.

Hope i was clear, my statistics is a bit rusty and I'm not even sure I am doing everything correctly. Take it as a hint to point you in the right direction to address the problem.

You can refer here to get more accurate reasoning and results.

As for problem 2)... I have to think about it a bit more.

Upvotes: 1

Related Questions