cgnorthcutt
cgnorthcutt

Reputation: 4046

How to create similarity matrix in Google BigQuery like pdist in MATLAB?

In MATLAB and Python (scipy), there is a function (pdist) which will return pairwise distances between every row of a given matrix.

So for a table in BigQuery:

A = user1 | 0 0 |
    user2 | 0 3 |
    user3 | 4 0 |

should return

              user1 user2 user3
dist = user1 |  0      3     4   | 
       user2 |  3      0     5   | 
       user3 |  4      5     0   | 

Or some variant (perhaps without diagonal and only the upper or lower half of the matrix since they are redundant.)

The pairs as columns is also acceptable (the approach (my guess so far) is to use a self join, but I am not sure how to iterate over all columns - for example I have ~3000 columns). This solution would look like:

dist = |user1  user2  3 | 
       |user1  user3  4 | 
       |user2  user3  5 |

Also as the distance metric between users, I don't necessarily just wan't Euclidean distance as the example here, but any general distance. One such distance is

sum(min(user1_d, user2_d) / diff(user1_d - user2_d)) for all d dimensions between the two users.

Has anyone found a Google BigQuery solution for this?

Upvotes: 1

Views: 1444

Answers (1)

Jordan Tigani
Jordan Tigani

Reputation: 26637

There are two answers:

  1. You can do this wth a cross join, and either build up a query programatically that will compute the distance function, or use the user-defined functions feature that is currently in closed alpha release.
  2. Please be judicious using cross join with large datasets. Cross Joins do N^2 work, and can generate N^2 output. A cross join of two smallish tables with only a million rows each generates a trillion rows of output. Cross joining two billion row tables generates a sestillion rows of output.

    If you do do cross joins of large datasets, you should try to pre-aggregate or filter to reduce the number of rows that actually need to be joined.

Upvotes: 3

Related Questions