itzy
itzy

Reputation: 11765

Calculate cosine similarity in SAS/IML

I have a matrix in SAS IML. For each pair of rows (say vectors A and B), I want to calculate the cosine similarity,

A . B / ( ||A|| x ||B|| ).

So the result should be a square matrix with the same number of rows as as the initial matrix.

If I pass a vector to the Euclid function, I get back a vector, so the function appears to be acting separately on each element of the vector. Indeed, the SAS documentation says:

If you call a Base SAS function with a matrix argument, the function will usually act elementwise on each element of teh [sic] matrix.

This is weird -- why would anyone want to calculate summary statistics for each element of a vector? They will always just return the elements. Is there a way to get the Euclidean norm for a vector?

My code is below. Notwithstanding the Euclidean norm, is there a more efficient way to do this?

proc iml;
 use fundstr;
 read all var _all_ into wgts;

 nrows=nrow(wgts);
 d=j(nrows,nrows,0);

 do i = 1 to nrows;
  do j = i to nrows;

  tmp = wgts[i,]*wgts[j,]`; /** need to divide by norms each vector **/
  d[i,j] = tmp;
  d[j,i] = tmp;

   end;
 end;
quit;

Upvotes: 1

Views: 1688

Answers (2)

Rick
Rick

Reputation: 1210

Use matrix operations and think of this problem as (A/||A||) * (B/||B||).

The first step is to divide each row by its Euclidean norm, which is just sqrt(ssq(wgts[i,])). You can use the "sum of squares" subscript reduction operator (##) to compute this for all rows at once without writing a loop: sqrt(wgts[ ,##]); (See http://blogs.sas.com/content/iml/2012/05/23/compute-statistics-for-each-row-by-using-subscript-operators/ for an explanation and examples of subscript reduction operators.)

The pairwise dot product of rows is equivalent to the matrix multiplication A*A`, where A is the scaled matrix. Putting this all together leads to the solution:

wgts = ranuni(j(5,5));         
norm = sqrt(wgts[ ,##]); /* Euclidean norm */
A = wgts/norm; 
d = A*A`;
print d;

If you want to compare this to the (inefficient) solution that uses loops, here it is:

nrows=nrow(wgts);
d=j(nrows,nrows,0);
do i = 1 to nrows;
   normi = sqrt(wgts[i,##]);
   do j = i to nrows;
      normj = sqrt(wgts[j,##]);
      tmp = wgts[i,]*wgts[j,]` / (normi * normj);
      d[i,j] = tmp;
      d[j,i] = tmp;
   end;
 end;
 print d;

By the way, you'll be happy to hear that in the next release of SAS/IML the typo in the doc is fixed :-)

Upvotes: 2

Robbie Liu
Robbie Liu

Reputation: 1511

To provide a reference, I think this article by Rick is probably a good read for you. The method converting vectors to comma-delimited string is quite convenient.

Upvotes: 1

Related Questions