Reputation: 4046
In MATLAB and Python (scipy), there is a function (pdist) which will return pairwise distances between every row of a given matrix.
So for a table in BigQuery:
A = user1 | 0 0 |
user2 | 0 3 |
user3 | 4 0 |
should return
user1 user2 user3
dist = user1 | 0 3 4 |
user2 | 3 0 5 |
user3 | 4 5 0 |
Or some variant (perhaps without diagonal and only the upper or lower half of the matrix since they are redundant.)
The pairs as columns is also acceptable (the approach (my guess so far) is to use a self join, but I am not sure how to iterate over all columns - for example I have ~3000 columns). This solution would look like:
dist = |user1 user2 3 |
|user1 user3 4 |
|user2 user3 5 |
Also as the distance metric between users, I don't necessarily just wan't Euclidean distance as the example here, but any general distance. One such distance is
sum(min(user1_d, user2_d) / diff(user1_d - user2_d)) for all d dimensions between the two users.
Has anyone found a Google BigQuery solution for this?
Upvotes: 1
Views: 1444
Reputation: 26637
There are two answers:
Please be judicious using cross join with large datasets. Cross Joins do N^2 work, and can generate N^2 output. A cross join of two smallish tables with only a million rows each generates a trillion rows of output. Cross joining two billion row tables generates a sestillion rows of output.
If you do do cross joins of large datasets, you should try to pre-aggregate or filter to reduce the number of rows that actually need to be joined.
Upvotes: 3