Reputation: 6030
We have a group of patients in one table and we want to match each of them to a patient exactly like them in another table - but we want pairs of patients so we cannot match a patient to more than one other patient.
Left Outer Joins add every occurrence of a match - which matches patients to every other possible match - so we need some other approach.
We see lots of answers on SO about matching to the first row - but that leaves us with a single patient being matched to multiple other patients - not a pair like we need.
Is there any possible way to create pair matches without duplication between tables in Google Big Query? (Even if it takes multiple steps.)
ADDENDUM: Here are example tables. It would be great to see a SQL example using this.
Here is what is needed.
Example Source Tables:
Table A
PatientID Race Gender
1 A F
2 B M
3 A F
Table B
PatientID
4 A F
5 A F
6 B M
Results Table Desired:
Table C
A.PatientID B.PatientID_Match
1 4
2 6
3 5
CLARIFICATION: Patients in Table A must match patients from Table B. (They cannot match patients in their own table.)
Upvotes: 3
Views: 1129
Reputation: 173121
SELECT
a.PatientID AS PatientID,
b.PatientID AS PatientID_Match
FROM (
SELECT PatientID, Race, Gender,
ROW_NUMBER() OVER(PARTITION BY Race, Gender) AS Pos
FROM TableA
) AS a
JOIN (
SELECT PatientID, Race, Gender,
ROW_NUMBER() OVER(PARTITION BY Race, Gender) AS Pos
FROM TableB
) AS b
ON a.Race = b.Race AND a.Gender = b.Gender AND a.Pos = b.Pos
Above will leave out those patients from TableA which either do not have match in TableB or potential match in TableB was already used as match for another patient in TableA (as per your we want pairs of patients so we cannot match a patient to more than one other patient.
requirement)
To address Dudu's comments about NULL for attributes:
SELECT
a.PatientID AS PatientID,
b.PatientID AS PatientID_Match
FROM (
SELECT
PatientID, IFNULL(Race, 'null') AS Race, IFNULL(Gender, 'null') AS Gender,
ROW_NUMBER() OVER(PARTITION BY Race, Gender) AS Pos
FROM TableA
) AS a
JOIN (
SELECT
PatientID, IFNULL(Race, 'null') AS Race, IFNULL(Gender, 'null') AS Gender,
ROW_NUMBER() OVER(PARTITION BY Race, Gender) AS Pos
FROM TableB
) AS b
ON a.Race = b.Race AND a.Gender = b.Gender AND a.Pos = b.Pos
Upvotes: 0
Reputation: 44971
select min (case tab when 'A' then patientID end) as A_patientID
,min (case tab when 'B' then patientID end) as B_patientID
from (select tab
,patientID
,rank() over (order by race,gender) r
,row_number() over (partition by tab,race,gender order by patientID) rn
from ( select 'A' as tab,A.* from A
union all select 'B' as tab,B.* from B
) t
) t
group by t.r
,t.rn
-- having count(*) = 2
;
+-------------+-------------+
| a_patientid | b_patientid |
+-------------+-------------+
| 3 | 5 |
+-------------+-------------+
| 2 | 6 |
+-------------+-------------+
| 1 | 4 |
+-------------+-------------+
The main idea -
Rows from both tables are divided to groups by their attributes (race,gender).
This is being done using the RANK function.
Within each group of attributes (race,gender) the rows are being ordered, per table, by their patientid .
+-----+-----------+------+--------+ +---+----+
| tab | patientid | race | gender | | r | rn |
+-----+-----------+------+--------+ +---+----+
+-----+-----------+------+--------+ +---+----+
| A | 1 | A | F | | 1 | 1 |
+-----+-----------+------+--------+ +---+----+
| B | 4 | A | F | | 1 | 1 |
+-----+-----------+------+--------+ +---+----+
+-----+-----------+------+--------+ +---+----+
| A | 3 | A | F | | 1 | 2 |
+-----+-----------+------+--------+ +---+----+
| B | 5 | A | F | | 1 | 2 |
+-----+-----------+------+--------+ +---+----+
+-----+-----------+------+--------+ +---+----+
| A | 2 | B | M | | 5 | 1 |
+-----+-----------+------+--------+ +---+----+
| B | 6 | B | M | | 5 | 1 |
+-----+-----------+------+--------+ +---+----+
In the final phase, the rows are being divided into groups (GROUP BY) by their RANK (r) and ROW_NUMBER (rn) values, which means each group has a row from each table (or only a single row if there is no matching row from the other table).
Upvotes: 2
Reputation: 1270713
In many databases, a lateral join would be the way to go. In Google, you can use row_number()
. The query looks something like this:
select p.*, pp.patient_id as other_patient_id
from patients p cross join
(select p.*,
row_number() over (partition by col1, col2, col3 order by col1) as seqnum
from patients p
) pp
where pp.seqnum = 1;
The columns in the partition by
are the columns used for similarity.
Upvotes: 1