Reputation: 11765
I have a matrix in SAS IML. For each pair of rows (say vectors A
and B
), I want to calculate the cosine similarity,
A . B / ( ||A|| x ||B|| )
.
So the result should be a square matrix with the same number of rows as as the initial matrix.
If I pass a vector to the Euclid function, I get back a vector, so the function appears to be acting separately on each element of the vector. Indeed, the SAS documentation says:
If you call a Base SAS function with a matrix argument, the function will usually act elementwise on each element of teh [sic] matrix.
This is weird -- why would anyone want to calculate summary statistics for each element of a vector? They will always just return the elements. Is there a way to get the Euclidean norm for a vector?
My code is below. Notwithstanding the Euclidean norm, is there a more efficient way to do this?
proc iml;
use fundstr;
read all var _all_ into wgts;
nrows=nrow(wgts);
d=j(nrows,nrows,0);
do i = 1 to nrows;
do j = i to nrows;
tmp = wgts[i,]*wgts[j,]`; /** need to divide by norms each vector **/
d[i,j] = tmp;
d[j,i] = tmp;
end;
end;
quit;
Upvotes: 1
Views: 1688
Reputation: 1210
Use matrix operations and think of this problem as (A/||A||) * (B/||B||).
The first step is to divide each row by its Euclidean norm, which is just sqrt(ssq(wgts[i,])). You can use the "sum of squares" subscript reduction operator (##) to compute this for all rows at once without writing a loop: sqrt(wgts[ ,##]); (See http://blogs.sas.com/content/iml/2012/05/23/compute-statistics-for-each-row-by-using-subscript-operators/ for an explanation and examples of subscript reduction operators.)
The pairwise dot product of rows is equivalent to the matrix multiplication A*A`, where A is the scaled matrix. Putting this all together leads to the solution:
wgts = ranuni(j(5,5));
norm = sqrt(wgts[ ,##]); /* Euclidean norm */
A = wgts/norm;
d = A*A`;
print d;
If you want to compare this to the (inefficient) solution that uses loops, here it is:
nrows=nrow(wgts);
d=j(nrows,nrows,0);
do i = 1 to nrows;
normi = sqrt(wgts[i,##]);
do j = i to nrows;
normj = sqrt(wgts[j,##]);
tmp = wgts[i,]*wgts[j,]` / (normi * normj);
d[i,j] = tmp;
d[j,i] = tmp;
end;
end;
print d;
By the way, you'll be happy to hear that in the next release of SAS/IML the typo in the doc is fixed :-)
Upvotes: 2
Reputation: 1511
To provide a reference, I think this article by Rick is probably a good read for you. The method converting vectors to comma-delimited string is quite convenient.
Upvotes: 1