Reputation: 159

REGEXP_SUBSTR to split comma separated values into rows not working

I have a table structure that contains data in column in comma separated values. I want to split these values into rows.

The data in table is something like this

MANUFACTURER_PART_NUMBER|MANUFACTURER_NAME        |TEST_PLAN           |ATTACHED_SUPPLIER_DOCUMENT |ATTACHED_LOGITECH_REPORT|
------------------------|-------------------------|--------------------|---------------------------|------------------------|
001059-0000             |CHENGDA                  |268452,268453       |268456,268457,268459,268460|268465                  |
001059-0000             |SHANGHAI MARRISON CO.,LTD|                    |268458,268462              |                        |
001059-0000             |SUZHOU SHARETECH         |                    |                           |                        |
001059-0000             |SYSTRON                  |268451,268452,268453|268456,268457,268459,268460|268465,268466           |

I have tried this query but it is not behaving well not sure the reason

WITH CTE AS(
SELECT DISTINCT 
MP.PART_NUMBER MANUFACTURER_PART_NUMBER, M.NAME MANUFACTURER_NAME, 
RTRIM(LTRIM(PG3.MULTILIST31, ','), ',') TEST_PLAN,
RTRIM(LTRIM(PG3.MULTILIST32, ','), ',') ATTACHED_SUPPLIER_DOCUMENT,
RTRIM(LTRIM(PG3.MULTILIST33, ','), ',') ATTACHED_LOGITECH_REPORT
FROM MANU_PARTS MP
INNER JOIN MANUFACTURERS M ON M.ID = MP.MANU_ID 
LEFT JOIN PAGE_TWO PG2 ON PG2.ID = MP.ID
LEFT JOIN PAGE_THREE PG3 ON PG3.ID = MP.ID

WHERE PART_NUMBER = '001059-0000'
)

SELECT DISTINCT 
MANUFACTURER_PART_NUMBER, MANUFACTURER_NAME, 
REGEXP_SUBSTR(TEST_PLAN, '[^,]+', 1, LEVEL) TEST_PLAN, 
REGEXP_SUBSTR(ATTACHED_SUPPLIER_DOCUMENT, '[^,]+', 1, LEVEL) ATTACHED_SUPPLIER_DOCUMENT, 
REGEXP_SUBSTR(ATTACHED_LOGITECH_REPORT, '[^,]+', 1, LEVEL) ATTACHED_LOGITECH_REPORT
FROM CTE 
CONNECT BY REGEXP_SUBSTR(NVL(NVL(TEST_PLAN, ATTACHED_SUPPLIER_DOCUMENT), ATTACHED_LOGITECH_REPORT), '[^,]+', 1, LEVEL) IS NOT NULL
ORDER BY MANUFACTURER_NAME

It should be giving the following output which is required too

MANUFACTURER_PART_NUMBER|MANUFACTURER_NAME        |TEST_PLAN|ATTACHED_SUPPLIER_DOCUMENT|ATTACHED_LOGITECH_REPORT|
------------------------|-------------------------|---------|--------------------------|------------------------|
001059-0000             |CHENGDA                  |268452   |268456                    |268465                  |
001059-0000             |CHENGDA                  |268453   |268457                    |                        |
001059-0000             |CHENGDA                  |         |268459                    |                        |
001059-0000             |CHENGDA                  |         |268460                    |                        |
001059-0000             |SHANGHAI MARRISON CO.,LTD|         |268458                    |                        |
001059-0000             |SHANGHAI MARRISON CO.,LTD|         |268462                    |                        |
001059-0000             |SUZHOU SHARETECH         |         |                          |                        |
001059-0000             |SYSTRON                  |268451   |268456                    |268465                  |
001059-0000             |SYSTRON                  |268452   |268457                    |268466                  |
001059-0000             |SYSTRON                  |268453   |268459                    |                        |
001059-0000             |SYSTRON                  |         |268460                    |                        |

But instead it is giving this out put and missing 2 rows or values for 1st CHENGDA and last 1 row of SYSTRON

MANUFACTURER_PART_NUMBER|MANUFACTURER_NAME        |TEST_PLAN|ATTACHED_SUPPLIER_DOCUMENT|ATTACHED_LOGITECH_REPORT|
------------------------|-------------------------|---------|--------------------------|------------------------|
001059-0000             |CHENGDA                  |268452   |268456                    |268465                  |
001059-0000             |CHENGDA                  |268453   |268457                    |                        |
001059-0000             |SHANGHAI MARRISON CO.,LTD|         |268458                    |                        |
001059-0000             |SHANGHAI MARRISON CO.,LTD|         |268462                    |                        |
001059-0000             |SUZHOU SHARETECH         |         |                          |                        |
001059-0000             |SYSTRON                  |268451   |268456                    |268465                  |
001059-0000             |SYSTRON                  |268452   |268457                    |268466                  |
001059-0000             |SYSTRON                  |268453   |268459                    |                        |

The reason which I found out is when I swap the position of Test Plan and Attached Supplier Document in Connect By clause which probably is because of number of values in both columns. But this should no be happening because what if Test Plan had more values in any case ?

Hope someone could help.

(Before marking this question duplicate please do ask if it already has an answer or not because that query did not work)

Upvotes: 1

Answers (2)

Muhammad Asim

Reputation: 159

WITH CTE AS(
SELECT DISTINCT 
MP.PART_NUMBER MANUFACTURER_PART_NUMBER, M.NAME MANUFACTURER_NAME, 
RTRIM(LTRIM(PG3.MULTILIST31, ','), ',') TEST_PLAN, 
RTRIM(LTRIM(PG3.MULTILIST32, ','), ',') ATTACHED_SUPPLIER_DOCUMENT, 
RTRIM(LTRIM(PG3.MULTILIST33, ','), ',') ATTACHED_LOGITECH_REPORT  
FROM MANU_PARTS MP
INNER JOIN MANUFACTURERS M ON M.ID = MP.MANU_ID 
LEFT JOIN PAGE_TWO PG2 ON PG2.ID = MP.ID
LEFT JOIN PAGE_THREE PG3 ON PG3.ID = MP.ID
WHERE PART_NUMBER = '001059-0000'
)
SELECT MANUFACTURER_PART_NUMBER, MANUFACTURER_NAME, 
REGEXP_SUBSTR(TEST_PLAN, '[^,]+', 1, column_value) TEST_PLAN, 
REGEXP_SUBSTR(ATTACHED_SUPPLIER_DOCUMENT, '[^,]+', 1, column_value) ATTACHED_SUPPLIER_DOCUMENT,
REGEXP_SUBSTR(ATTACHED_LOGITECH_REPORT, '[^,]+', 1, column_value) ATTACHED_LOGITECH_REPORT

FROM CTE 
CROSS JOIN TABLE(CAST(MULTISET(SELECT LEVEL FROM DUAL
CONNECT BY LEVEL <= GREATEST(REGEXP_COUNT(NVL(ATTACHED_LOGITECH_REPORT, 0), ','), REGEXP_COUNT(NVL(TEST_PLAN, 0), ','), REGEXP_COUNT(NVL(ATTACHED_SUPPLIER_DOCUMENT, 0), ','))+1 
) AS sys.odcinumberlist))

I have managed to get the desired result by checking the maximum count of the column and allocating that to each item.

Here is the output.

MANUFACTURER_PART_NUMBER|MANUFACTURER_NAME        |TEST_PLAN|ATTACHED_SUPPLIER_DOCUMENT|ATTACHED_LOGITECH_REPORT|
------------------------|-------------------------|---------|--------------------------|------------------------|
001059-0000             |CHENGDA                  |268452   |268456                    |268465                  |
001059-0000             |CHENGDA                  |268453   |268457                    |                        |
001059-0000             |CHENGDA                  |         |268459                    |                        |
001059-0000             |CHENGDA                  |         |268460                    |                        |
001059-0000             |SYSTRON                  |268451   |268456                    |268465                  |
001059-0000             |SYSTRON                  |268452   |268457                    |268466                  |
001059-0000             |SYSTRON                  |268453   |268459                    |                        |
001059-0000             |SYSTRON                  |         |268460                    |                        |
001059-0000             |SUZHOU SHARETECH         |         |                          |                        |
001059-0000             |SHANGHAI MARRISON CO.,LTD|         |268458                    |                        |
001059-0000             |SHANGHAI MARRISON CO.,LTD|         |268462                    |                        |

Upvotes: 0

Alex Poole

Reputation: 191275

As pointed out in comments, your data model isn't ideal, and you shouldn't be storing lists of values in strings. Assuming you're stuck with that, your problem is that the lists have different numbers of elements, so the depth can be different for each, leading to different limits on level; and trying to guess which to base it on is going to be a problem, as you've already realised.

You could possibly try to figure out which value to use for each ID using regexp_count and greatest but that's going to be tricky with multiple rows to deal with as well. (The distinct is an indication that you already have an issue there...)

Another approach is to split up each list individually and then try to associate matching elements - though that seems fairly arbitrary in itself. This seems to get the result you want though:

select mp.part_number,
  m.name,
  t.value31,
  t.value32,
  t.value33
from manu_parts mp
join manufacturers m
on m.id = mp.manu_id
left join (
  select coalesce(t31.id, t32.id, t33.id) as id,
    coalesce(t31.lvl, t32.lvl, t33.lvl) as lvl,
    t31.value as value31,
    t32.value as value32,
    t33.value as value33
  from 
  (
    select id, level as lvl,
      regexp_substr(multilist31, '(.*?)(,|$)', 1, level, null, 1) as value
    from page_three
    connect by id = prior id
    and prior dbms_random.value is not null
    and level < regexp_count(multilist31, '(.*?)(,|$)')
  ) t31
  full outer join (
    select id, level as lvl,
      regexp_substr(multilist32, '(.*?)(,|$)', 1, level, null, 1) as value
    from page_three
    connect by id = prior id
    and prior dbms_random.value is not null
    and level < regexp_count(multilist32, '(.*?)(,|$)')
  ) t32 
  on t32.id = t31.id and t32.lvl = t31.lvl
  full outer join (
    select id, level as lvl,
      regexp_substr(multilist33, '(.*?)(,|$)', 1, level, null, 1) as value
    from page_three
    connect by id = prior id
    and prior dbms_random.value is not null
    and level < regexp_count(multilist33, '(.*?)(,|$)')
  ) t33
  on (t33.id = t32.id and t33.lvl = t32.lvl)
  or (t33.id = t31.id and t33.lvl = t31.lvl)
) t on t.id = mp.id
where part_number = '001059-0000'
order by part_number, name, lvl;

PART_NUMBER NAME                      VALUE31    VALUE32    VALUE33   
----------- ------------------------- ---------- ---------- ----------
001059-0000 CHENGDA                   268452     268456     268465    
001059-0000 CHENGDA                   268453     268457               
001059-0000 CHENGDA                              268459               
001059-0000 CHENGDA                              268460               
001059-0000 SHANGHAI MARRISON CO.,LTD            268458               
001059-0000 SHANGHAI MARRISON CO.,LTD            268462               
001059-0000 SUZHOU SHARETECH                                          
001059-0000 SYSTRON                   268451     268456     268465    
001059-0000 SYSTRON                   268452     268457     268466    
001059-0000 SYSTRON                   268453     268459               
001059-0000 SYSTRON                              268460               

11 rows selected.

Each subquery like:

  (
    select id, level as lvl,
      regexp_substr(multilist31, '(.*?)(,|$)', 1, level, null, 1) as value
    from page_three
    connect by id = prior id
    and prior dbms_random.value is not null
    and level < regexp_count(multilist31, '(.*?)(,|$)')
  ) t31

splits a single column from that table, preserving the ID to join against later, and the level to associate with the other columns for that ID. The three of those are then outer joined together; and the result of that is joined to the main table.

The performance might be horrible, but if you're filtering on a single part number then that might be pushed into the subqueries so it doesn't do more work than needed.

You could also look at recursive subquery factoring instead of hierarchical queries, or possibly correlated subqueries with collections, but they will all have their own issues. That's what happens when you have to work around a bad data model.

Upvotes: 1

REGEXP_SUBSTR to split comma separated values into rows not working

Answers (2)

Related Questions