Reputation: 35
I have 2 ID columns that are created/collected independently. I'm trying to consolidate these two ID columns into one by determining which rows are part of the same related group of ids based on either of the two ID columns. I would consider the rows to be related based on a few rules:
1: If a LOAN has the same value in multiple rows, they belong to the same group (in the example for reference only.) I've called it loan_group. No issues here.
2: If a COLLATERAL has the same value in multiple rows, they belong to the temporary group. I've called it collateral_group (same rule as #1.) No issues here.
3: Finally, and I'm not sure how to phrase this exactly, but any time there is overlap between values that are part of the same group (across loan and collateral columns), those groups should be further consolidated. For example:
LOAN COLLATERAL loan_group collateral_group final_grouping
---- ----------- ---------- ---------------- --------------
L1 C1* 1 1 **1**
L2** C1* 2 1 **1**
L5 C8 3 2 2
L2** C4*** 2 3 **1**
L6 C8 4 2 2
L7 C9 5 4 3
L8 C4*** 6 3 **1**
*because rows 1 and 2 both have the value C1, they would be assigned to the same final grouping
**because row 2 has the LOAN value L2, this means we can include row 4 in the consolidated final grouping. That row can be linked back to row 1 via the L2/C1 link
***finally, because row 4 includes the COLLATERAL value C4, this means we can include row 7 in the consolidated final grouping. That row can be linked back to row one via the L2/C4 & L2/C1 links
The data set is roughly 15m unique combinations of LOAN + COLLATERAL. The groups will likely crossover a few thousand (maybe +10 thousand) IDs in some edge cases. I've run into some resource issues on BQ testing some solutions (but those issues are mostly a do with my inexperience with BQ.) Just a heads up if that impacts anybody's recommendation.
Really appreciate your time, apologies for being overly vague/brief in my first version...
Upvotes: 1
Views: 314
Reputation: 173046
Below is for BigQuery Standard SQL
As Gordon mentioned in comments - BigQuery doesn't have native support for recursive CTEs or hierarchical queries, so this cannot be done with just a single query!
BUT ..., this can be implemented using recently introduced scripting as in example below
DECLARE rows_count, run_away_stop INT64 DEFAULT 0;
CREATE TEMP TABLE input AS (
SELECT 'L1' loan, 'C1' collateral UNION ALL
SELECT 'L2', 'C1' UNION ALL
SELECT 'L5', 'C8' UNION ALL
SELECT 'L2', 'C4' UNION ALL
SELECT 'L6', 'C8' UNION ALL
SELECT 'L7', 'C9' UNION ALL
SELECT 'L8', 'C4'
);
CREATE TEMP TABLE initial_grouping AS
SELECT ARRAY_AGG(collateral ORDER BY collateral) arr
FROM input
GROUP BY loan;
LOOP
SET rows_count = (SELECT COUNT(1) FROM initial_grouping);
SET run_away_stop = run_away_stop + 1;
CREATE OR REPLACE TEMP TABLE initial_grouping AS
SELECT ANY_VALUE(arr) arr FROM (
SELECT ARRAY(SELECT DISTINCT val FROM UNNEST(arr) val ORDER BY val) arr
FROM (
SELECT ANY_VALUE(arr1) arr1, ARRAY_CONCAT_AGG(arr) arr
FROM (
SELECT t1.arr arr1, t2.arr arr2, ARRAY(SELECT DISTINCT val FROM UNNEST(ARRAY_CONCAT( t1.arr, t2.arr)) val ORDER BY val) arr
FROM initial_grouping t1, initial_grouping t2
WHERE (SELECT COUNT(1) FROM UNNEST(t1.arr) val JOIN UNNEST(t2.arr) val USING(val)) > 0
) GROUP BY FORMAT('%t', arr1)
)
) GROUP BY FORMAT('%t', arr);
IF (rows_count = (SELECT COUNT(1) FROM initial_grouping) AND run_away_stop > 1) OR run_away_stop > 10 THEN BREAK; END IF;
END LOOP;
SELECT loan, collateral, final_grouping FROM input
JOIN (SELECT ROW_NUMBER() OVER() final_grouping, arr FROM initial_grouping)
ON collateral IN UNNEST(arr)
ORDER BY loan, collateral;
Above script produces below result (which I believe is exactly what you are looking for)
Row loan collateral final_grouping
1 L1 C1 1
2 L2 C1 1
3 L2 C4 1
4 L5 C8 3
5 L6 C8 3
6 L7 C9 2
7 L8 C4 1
Please note: when applying to real data - make sure you set appropriate max for run_away_stop (in above script it is 10 - see last statement within LOOP - you might need to increase it to make sure conversion will complete)
Finally: to apply to your real table:
1 - remove CREATE TEMP TABLE input (...)
statement
2 - replace input
with your_project.your_dataset.your_table
in CREATE TEMP TABLE initial_grouping AS ...
statement
Upvotes: 3
Reputation: 249
Your description is a bit vague, so hard to tell whether this is the asked for solution. Try to elaborate a bit more on exactly what you need regarding "belong to same 'path' of groups and then to perform allocation within group."
SELECT Loan, Collateral, COUNT(*) AS Group FROM [TABLE]
GROUP BY Loan, Collateral
You need to change the tableName [TABLE]
-- UPDATE
There might be a problem with your presentation, as the numbers don't add up.
I've tried to make a double grouping here, but there seems to be a problem around L2 and C8 which breaks the logic, as these two values behave differently.
SELECT * FROM [Tabel] AS A
LEFT JOIN
(
SELECT [Loan], COUNT(*) AS Rows_Loan FROM [Tabel]
GROUP BY [Loan]
) AS T
ON A.Loan = T.Loan
LEFT JOIN (
SELECT [Collateral], COUNT(*) AS Rows_Collateral FROM [Tabel]
GROUP BY [Collateral]
) AS T2
ON A.Collateral = T2.Collateral
Result:
Loan Collateral Loan Rows_Loan Collateral Rows_Collateral
L1 C1 L1 1 C1 2
L2 C1 L2 2 C1 2
L5 C8 L5 1 C8 2
L2 C4 L2 2 C4 1
L6 C8 L6 1 C8 2
Upvotes: 1