Reputation: 423
Now, I have separate any pair that is in common between the two input files. Find out the mean between that pair like this : (correlation in first text file)X(correlation in second text file)/ (correlation in first text file)+(correlation in second text file). Again store these in a separate matrix.
Building a tree : Now, out of all the elements in both the input files, select the 10 most frequent ones. Each of these form the root of a separate K tree.The algorithm goes like this : For the word at the root level, check all its harmonic mean values with the other tags in the matrix that is developed in the previous step. Select the top two highest harmonic means, and put the other word in the tag pair as the child node of the root.
Can someone please guide me through the MATLAB steps of going through this? Thank you for your time.
Upvotes: 0
Views: 391
Reputation: 4398
Okay, so start by putting the data in a useful format; maybe count the number of distinct words, and make an N-by-M matrix of binary values (I'll call this data1
). Each of the N rows will describe the words associated with a single image. Each of the M columns will descibe the images for which a single word is tagged. Therefore, the value at (N, M) is 0 if tag M is not in image N, and 1 if it is.
From this matrix, to find correlation between all pairs of words, you could do:
correlations1 = zeros(M, M);
for i=1:M
for j=1:M
correlations1(i, j) = corr(data1(:, i), data1(:, j));
end
end
now the matrix correlations
tells you the correlation between tags. Do the same for the other text file. You can make a matrix of harmonic means with:
h_means = correlations1.*correlations2./(correlations1+correlations2);
You can find the 30 most freqent tags by counting the number of 1s in each column of the data matrix. Since we want to find the most common tags in both files, we'll add the data matricies first:
[~, tag_ranks] = sort(sum(data1 + data2, 1), 'descending'); %get the indices in sorted order
top_tags = tag_ranks(1:30);
For the tree building at the end, you will either want to create a tree class (see classdef), or store the tree in an array. To find the top two highest harmonic means, you will want to look in the h_means matrix; for a tag m1, we can do:
[~, tag_ranks] = sort(h_means(m1, :), 'descending');
top_tag = tag_ranks(1);
second_tag = tag_ranks(2);
You will then need to insert these tags into the tree and repeat.
Upvotes: 1