Praxiteles
Praxiteles

Reputation: 6030

Is there any way in Google Big Query to Left Outer Join one-to-one *without reusing* any row from the right?

We have a group of patients in one table and we want to match each of them to a patient exactly like them in another table - but we want pairs of patients so we cannot match a patient to more than one other patient.

Left Outer Joins add every occurrence of a match - which matches patients to every other possible match - so we need some other approach.

We see lots of answers on SO about matching to the first row - but that leaves us with a single patient being matched to multiple other patients - not a pair like we need.

Is there any possible way to create pair matches without duplication between tables in Google Big Query? (Even if it takes multiple steps.)


ADDENDUM: Here are example tables. It would be great to see a SQL example using this.

Here is what is needed.

Example Source Tables:

Table A
PatientID     Race     Gender    
   1            A        F
   2            B        M
   3            A        F

Table B
PatientID
   4            A        F
   5            A        F
   6            B        M


Results Table Desired:

Table C
A.PatientID     B.PatientID_Match
    1               4
    2               6
    3               5

CLARIFICATION: Patients in Table A must match patients from Table B. (They cannot match patients in their own table.)

Upvotes: 3

Views: 1129

Answers (3)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 173121

SELECT 
  a.PatientID AS PatientID,
  b.PatientID AS PatientID_Match
FROM (
  SELECT PatientID, Race, Gender, 
    ROW_NUMBER() OVER(PARTITION BY Race, Gender) AS Pos
  FROM TableA
) AS a
JOIN (
  SELECT PatientID, Race, Gender, 
    ROW_NUMBER() OVER(PARTITION BY Race, Gender) AS Pos
  FROM TableB
) AS b
ON a.Race = b.Race AND a.Gender = b.Gender AND a.Pos = b.Pos

Above will leave out those patients from TableA which either do not have match in TableB or potential match in TableB was already used as match for another patient in TableA (as per your we want pairs of patients so we cannot match a patient to more than one other patient. requirement)

To address Dudu's comments about NULL for attributes:

SELECT 
  a.PatientID AS PatientID,
  b.PatientID AS PatientID_Match
FROM (
  SELECT 
    PatientID, IFNULL(Race, 'null') AS Race, IFNULL(Gender, 'null') AS Gender, 
    ROW_NUMBER() OVER(PARTITION BY Race, Gender) AS Pos
  FROM TableA
) AS a
JOIN (
  SELECT 
    PatientID, IFNULL(Race, 'null') AS Race, IFNULL(Gender, 'null') AS Gender, 
    ROW_NUMBER() OVER(PARTITION BY Race, Gender) AS Pos
  FROM TableB
) AS b
ON a.Race = b.Race AND a.Gender = b.Gender AND a.Pos = b.Pos

Upvotes: 0

David דודו Markovitz
David דודו Markovitz

Reputation: 44971

select      min (case tab when 'A' then patientID end)  as A_patientID  
           ,min (case tab when 'B' then patientID end)  as B_patientID

from       (select  tab
                   ,patientID
                   ,rank()       over (order by     race,gender)                        r
                   ,row_number() over (partition by tab,race,gender order by patientID) rn

            from    (           select 'A' as tab,A.* from A 
                    union all   select 'B' as tab,B.* from B
                    ) t
            ) t

group by    t.r
           ,t.rn

-- having       count(*) = 2
;

+-------------+-------------+
| a_patientid | b_patientid |
+-------------+-------------+
| 3           | 5           |
+-------------+-------------+
| 2           | 6           |
+-------------+-------------+
| 1           | 4           |
+-------------+-------------+

The main idea -

Rows from both tables are divided to groups by their attributes (race,gender).
This is being done using the RANK function.

Within each group of attributes (race,gender) the rows are being ordered, per table, by their patientid .

+-----+-----------+------+--------+    +---+----+
| tab | patientid | race | gender |    | r | rn |
+-----+-----------+------+--------+    +---+----+

+-----+-----------+------+--------+    +---+----+
| A   | 1         | A    | F      |    | 1 | 1  |
+-----+-----------+------+--------+    +---+----+
| B   | 4         | A    | F      |    | 1 | 1  |
+-----+-----------+------+--------+    +---+----+

+-----+-----------+------+--------+    +---+----+
| A   | 3         | A    | F      |    | 1 | 2  |
+-----+-----------+------+--------+    +---+----+
| B   | 5         | A    | F      |    | 1 | 2  |
+-----+-----------+------+--------+    +---+----+

+-----+-----------+------+--------+    +---+----+
| A   | 2         | B    | M      |    | 5 | 1  |
+-----+-----------+------+--------+    +---+----+
| B   | 6         | B    | M      |    | 5 | 1  |
+-----+-----------+------+--------+    +---+----+

In the final phase, the rows are being divided into groups (GROUP BY) by their RANK (r) and ROW_NUMBER (rn) values, which means each group has a row from each table (or only a single row if there is no matching row from the other table).

Upvotes: 2

Gordon Linoff
Gordon Linoff

Reputation: 1270713

In many databases, a lateral join would be the way to go. In Google, you can use row_number(). The query looks something like this:

select p.*, pp.patient_id as other_patient_id
from patients p cross join
     (select p.*,
             row_number() over (partition by col1, col2, col3 order by col1) as seqnum
      from patients p
     ) pp
where pp.seqnum = 1;

The columns in the partition by are the columns used for similarity.

Upvotes: 1

Related Questions