Koushur
Koushur

Reputation: 301

Extracting and Joining Variable-Length Account Numbers from Unstructured Strings using Snowflake

Problem: I have two tables in Snowflake: a dimension table with account information and a fact table where the account numbers are embedded as part of a string in a PARAM_VALUE column of type string. These strings can have various formats and delimiters, or no separators at all, which makes it challenging to extract account numbers for joining.

I want to extract and flatten the account numbers from the PARAM_VALUE column in the fact table, and then join them with the dimension table to display corresponding account details. However, I don’t want to rely on any specific separator or formatting in PARAM_VALUE.

The dimension table contains account details:

Dimension Table (account_dim)

Acc_no
8G4F2H1J9K
4R6T8E3A1L
9W5E2C7R4T
1Q8N6F4G3H
6M9R5E2C1A
3K9P7R4G6T
5E1C8R6T2H
2T4R8G6F1N
7H3G5E2C9R
1L6T8R4E3A
9R5E2C7K3G
4F1N6H8R2T
8G6T4R3E1C
2H5E1C9R7K
6T4R3G6F1N
1C8R6T2H4G
3R5E2C9K7H
5T4R6G1F2E
9K3G5E2C1R
7R4G6T2H1L
Apex Industries
Nova Tech Solutions
Global Finance Corp
Eastern Logistics
Western Digital Inc
Cybernetic Systems
Green Earth Energy
Atlantic Healthcare
Pacific Software
Northern Lights Media
Southern Comfort Foods
Mountain View Ventures
Valley Forge Investments
Oceanic Exploration
Sunset Properties
Cityview Financial
Riverside Manufacturing
Skyline Advertising
Brookside Realty
Lakeside Construction

The fact table has billions of rows and looks like this:

Fact Table (fact_table)

SQL_ID OBJ_NAME Param_Value
sql_id890890830492hjkhfdjkd899058390245 customer_table ('TN190423','TN80989073') ,8G4F2H1J9K &niveiui
sql_id234567890123456789012345678901 orders_table 788686786767%9W5E2C7R4T jhjhjgjhg
sql_id987654321098765432109876543210 products_table ('5T4R6G1F2E','4R6T8E3A1L') AC9879879
sql_id456789012345678901234567890123 shipping_table (' Apex Industries
sql_id111111111111111111111111111111 inventory_table ' 3K9P7R4G6T,4F1N6H8R2T' TN0008989
sql_id222222222222222222222222222222 shipping_table ('9W5E2C7R4T','1Q8N6F4G3H') Eastern Logistics, Riverside Manufacturing jkjkhkj
sql_id333333333333333333333333333333 returns_table '6M9R5E2C1A,3K9P7R4G6T' ' 4R6T8E3A1L,'
sql_id444444444444444444444444444444 customers_table ('5E1C8R6T2H','2T4R8G6F1N') [ 6M9R5E2C1A , 2H5E1C9R7K ]
sql_id555555555555555555555555555555 orders_table '7H3G5E2C9R,1L6T8R4E3A'
sql_id666666666666666666666666666666 products_table ('3K9P7R4G6T','7897')
sql_id777777777777777777777777777777 inventory_table '9R5E2C7K3G,4F1N6H8R2T' &&&8G6T4R3E1C AC8098908,AC465465
sql_id678901234567890123456789012345 customers_table Cityview Financial;; Oceanic Exploration
sql_id789012345678901234567890123456 orders_table TN00098F1NBrookside Realty'
sql_id890123456789012345678901234567 products_table ('6uyiyiuuitug',' 6T4R3G6F1N ') Apex Industries

Expected Output: I need to extract patterns from PARAM_VALUE column in the fact_table having loolup at dim table account_dim and flatten them across fact table rowsand producing a new table like this:

SQL_ID OBJ_NAME searched pattern
sql_id890890830492hjkhfdjkd899058390245 customer_table 8G4F2H1J9K
sql_id234567890123456789012345678901 orders_table 9W5E2C7R4T
sql_id987654321098765432109876543210 products_table 5T4R6G1F2E
sql_id987654321098765432109876543210 products_table 4R6T8E3A1L
sql_id456789012345678901234567890123 shipping_table Apex Industries
sql_id456789012345678901234567890123 shipping_table 9K3G5E2C1R
sql_id111111111111111111111111111111 inventory_table 4F1N6H8R2T
sql_id111111111111111111111111111111 inventory_table 3K9P7R4G6T
sql_id222222222222222222222222222222 shipping_table 9W5E2C7R4T
sql_id222222222222222222222222222222 shipping_table 1Q8N6F4G3H
sql_id222222222222222222222222222222 shipping_table Eastern Logistics
sql_id222222222222222222222222222222 shipping_table Riverside Manufacturing
sql_id333333333333333333333333333333 returns_table 6M9R5E2C1A
sql_id333333333333333333333333333333 returns_table 3K9P7R4G6T
sql_id333333333333333333333333333333 returns_table 4R6T8E3A1L
sql_id444444444444444444444444444444 customers_table 5E1C8R6T2H
sql_id444444444444444444444444444444 customers_table 2T4R8G6F1N
sql_id444444444444444444444444444444 customers_table 6M9R5E2C1A
sql_id444444444444444444444444444444 customers_table 2H5E1C9R7K
sql_id555555555555555555555555555555 orders_table 7H3G5E2C9R
sql_id555555555555555555555555555555 orders_table 1L6T8R4E3A
sql_id666666666666666666666666666666 products_table 3K9P7R4G6T
sql_id777777777777777777777777777777 inventory_table 9R5E2C7K3G
sql_id777777777777777777777777777777 inventory_table 4F1N6H8R2T
sql_id777777777777777777777777777777 inventory_table 8G6T4R3E1C
sql_id678901234567890123456789012345 customers_table Cityview Financial
sql_id678901234567890123456789012345 customers_table Oceanic Exploration
sql_id789012345678901234567890123456 orders_table Brookside Realty
sql_id890123456789012345678901234567 products_table Apex Industries
sql_id890123456789012345678901234567 products_table 6T4R3G6F1N

Account can be of varying length and can have both account number and account name and can adapt any format. So there should not be split done as it can be seperated or not seperated and can use any seperator in between. SO focus is on searach from dim table.

I tried using the following REGEXP_LIKE query to match and extract the account numbers. However, this query does not return any results. The irregular formatting of the param_value column seems to be causing issues, and REGEXP_LIKE is not finding matches reliably.

WITH flattened AS (
  SELECT 
    f.sql_id,
    f.obj_name,
    a.Acc_No,
    a.Acc_name
  FROM 
    fact_table f
  JOIN 
    account_dim a ON REGEXP_LIKE(UPPER(f.param_value), UPPER(a.Acc_No))
)
SELECT 
  sql_id,
  obj_name,
  Acc_No,
  Acc_name
FROM 
  flattened;

Upvotes: 0

Views: 27

Answers (0)

Related Questions