s.e
s.e

Reputation: 271

Matlab: Calculating the pairwise distance for a multidimensional cell array

I have a multidimensional cell array attributes (763x6 cell).attributes: input

I have no syntax errors. The distance matrix D that results from my code have the same values for each row. I don't know how to my distance function to be able to handle multiple rows / instances.

D:

D

Sample of my data 5x6 cell:

'low back pain risk factor staff'   'low back pain' 'low back pain risk factor staff'   'back pain pain risk factor epidemiology' 'spiritual comment comment care be' 'spiritual comment comment care be'
'psd psd antipsychotic essential receptor'  'ht ht 5'   'antipsychotic protein signal receptor drug'    'cell protein signal cell receptor' 'spiritual comment comment care be' 'spiritual comment comment care be'
'school of medicine'    'case western reserve university'   'antidepressant action 5 for in'    'ht ht 5' 'spiritual comment comment care be' 'spiritual comment comment care be'
'spiritual comment comment care be' 'heal holistic comment india india' 'heal religious mental disorder psychiatric symptom'    'heal religious mental disorder psychiatric psychiatric' 'spiritual comment comment care be' 'spiritual comment comment care be'

Upvotes: 1

Views: 210

Answers (2)

Will
Will

Reputation: 1880

The problem is with your distance function, which needs to be able to return multiple distances when given multiple rows in the second argument, as detailed in the table in the pdist2 documentation.

It also seems to be handling the cell arrays generated by regexp wrong. By using cellfun to pass cell arrays of words to intersect, the intersect function is being asked to compare the letters in different words.

I believe the following function returns values with the desired effect:

function D2 = intersection(XI,XJ)

wordsI = regexp(XI, '\s+', 'split');
wordsJ = regexp(XJ, '\s+', 'split');

D2 = zeros(size(XJ,1),1);
for i=1:numel(D2)
    D2(i) = sum(cellfun(@(wI,wJ) numel(intersect(wI,wJ)), wordsI, wordsJ(i,:)));
end

Upvotes: 0

kabdulla
kabdulla

Reputation: 5419

This is not a solution, but is too long to fit in as a comment. The problem is in how pdist2 is calculating the pair-wise distances.

To quickly check this we can pass it a distance function which just prints out the XI and XJ arguments passed to it (when it is called from pdist2):

X = {'foo1', 'foo2', 'foo3', 'foo4', 'foo5', 'foo6';...
     'bar1', 'bar2', 'bar3', 'bar4', 'bar5', 'bar6'};

% call distance function via pdist2
D = pdist2(X,X,@printArgsIn);

And in a function file:

function D2 = printArgsIn(XI,XJ)
    disp('XI'); disp(XI);
    disp('XJ'); disp(XJ);

    D2 = 1;
end

This returns the following:

XI
    'foo1'    'foo2'    'foo3'    'foo4'    'foo5'    'foo6'

XJ
    'foo1'    'foo2'    'foo3'    'foo4'    'foo5'    'foo6'

XI
    'foo1'    'foo2'    'foo3'    'foo4'    'foo5'    'foo6'

XJ
    'foo1'    'foo2'    'foo3'    'foo4'    'foo5'    'foo6'
    'bar1'    'bar2'    'bar3'    'bar4'    'bar5'    'bar6'

XI
    'bar1'    'bar2'    'bar3'    'bar4'    'bar5'    'bar6'

XJ
    'foo1'    'foo2'    'foo3'    'foo4'    'foo5'    'foo6'
    'bar1'    'bar2'    'bar3'    'bar4'    'bar5'    'bar6'

Ignoring the first XI, XJ pair (if you look at pdist2 in detail you'll see distance function is called once to test it works), you can see that it calls the distance function on observation 1 of XI against all observations of XJ.

In other words it expects your distance function to be able to handle multiple rows/instances, and return a column vector of distances. I haven't looked at your distance function in detail, but I don't think you are allowing for this.

Upvotes: 1

Related Questions