How to consolidate two id columns, identifying which rows belong to same set of related IDs

Question

I have 2 ID columns that are created/collected independently. I'm trying to consolidate these two ID columns into one by determining which rows are part of the same related group of ids based on either of the two ID columns. I would consider the rows to be related based on a few rules:

1: If a LOAN has the same value in multiple rows, they belong to the same group (in the example for reference only.) I've called it loan_group. No issues here.

2: If a COLLATERAL has the same value in multiple rows, they belong to the temporary group. I've called it collateral_group (same rule as #1.) No issues here.

3: Finally, and I'm not sure how to phrase this exactly, but any time there is overlap between values that are part of the same group (across loan and collateral columns), those groups should be further consolidated. For example:

LOAN  COLLATERAL  loan_group  collateral_group  final_grouping
----  ----------- ----------  ----------------  --------------
L1    C1*         1           1                 **1**
L2**  C1*         2           1                 **1**
L5    C8          3           2                 2
L2**  C4***       2           3                 **1**
L6    C8          4           2                 2
L7    C9          5           4                 3
L8    C4***       6           3                 **1**

*because rows 1 and 2 both have the value C1, they would be assigned to the same final grouping

**because row 2 has the LOAN value L2, this means we can include row 4 in the consolidated final grouping. That row can be linked back to row 1 via the L2/C1 link

***finally, because row 4 includes the COLLATERAL value C4, this means we can include row 7 in the consolidated final grouping. That row can be linked back to row one via the L2/C4 & L2/C1 links

The data set is roughly 15m unique combinations of LOAN + COLLATERAL. The groups will likely crossover a few thousand (maybe +10 thousand) IDs in some edge cases. I've run into some resource issues on BQ testing some solutions (but those issues are mostly a do with my inexperience with BQ.) Just a heads up if that impacts anybody's recommendation.

Really appreciate your time, apologies for being overly vague/brief in my first version...

Mikhail Berlyant · Accepted Answer

Below is for BigQuery Standard SQL

As Gordon mentioned in comments - BigQuery doesn't have native support for recursive CTEs or hierarchical queries, so this cannot be done with just a single query!

BUT ..., this can be implemented using recently introduced scripting as in example below

DECLARE rows_count, run_away_stop INT64 DEFAULT 0;

CREATE TEMP TABLE input AS (
  SELECT 'L1' loan, 'C1' collateral UNION ALL
  SELECT 'L2', 'C1' UNION ALL
  SELECT 'L5', 'C8' UNION ALL
  SELECT 'L2', 'C4' UNION ALL
  SELECT 'L6', 'C8' UNION ALL
  SELECT 'L7', 'C9' UNION ALL
  SELECT 'L8', 'C4'
);

CREATE TEMP TABLE initial_grouping AS 
SELECT ARRAY_AGG(collateral ORDER BY collateral) arr 
FROM input
GROUP BY loan;

LOOP
  SET rows_count = (SELECT COUNT(1) FROM initial_grouping);
  SET run_away_stop = run_away_stop + 1;

  CREATE OR REPLACE TEMP TABLE initial_grouping AS
  SELECT ANY_VALUE(arr) arr FROM (
    SELECT ARRAY(SELECT DISTINCT val FROM UNNEST(arr) val ORDER BY val) arr
    FROM (
      SELECT ANY_VALUE(arr1) arr1, ARRAY_CONCAT_AGG(arr) arr    
      FROM (
        SELECT t1.arr arr1, t2.arr arr2, ARRAY(SELECT DISTINCT val FROM UNNEST(ARRAY_CONCAT( t1.arr, t2.arr)) val ORDER BY val) arr 
        FROM initial_grouping t1, initial_grouping t2 
        WHERE (SELECT COUNT(1) FROM UNNEST(t1.arr) val JOIN UNNEST(t2.arr) val USING(val)) > 0
      ) GROUP BY FORMAT('%t', arr1)
    )
  ) GROUP BY FORMAT('%t', arr);

  IF (rows_count = (SELECT COUNT(1) FROM initial_grouping) AND run_away_stop > 1) OR run_away_stop > 10 THEN BREAK; END IF;
END LOOP;

SELECT loan, collateral, final_grouping FROM input 
JOIN (SELECT ROW_NUMBER() OVER() final_grouping, arr FROM initial_grouping) 
ON collateral IN UNNEST(arr) 
ORDER BY loan, collateral;

Above script produces below result (which I believe is exactly what you are looking for)

Row loan    collateral  final_grouping   
1   L1      C1          1    
2   L2      C1          1    
3   L2      C4          1    
4   L5      C8          3    
5   L6      C8          3    
6   L7      C9          2    
7   L8      C4          1

Please note: when applying to real data - make sure you set appropriate max for run_away_stop (in above script it is 10 - see last statement within LOOP - you might need to increase it to make sure conversion will complete)

Finally: to apply to your real table:

1 - remove CREATE TEMP TABLE input (...) statement
2 - replace input with your_project.your_dataset.your_table in CREATE TEMP TABLE initial_grouping AS ... statement

How to consolidate two id columns, identifying which rows belong to same set of related IDs

Answers (2)

Related Questions