Semen Yefimenko
Semen Yefimenko

Reputation: 41

Oracle SQL splitting strings with delimiter in column

I've got a rough table structure like this:

ID Value Project Last update
1 Hashtag# testing 123981010
2 I like,it admin 123129319

We have a very complex mechanism which results in a new table which needs these 3 columns. (has about 68 mil entries) For reasons I don't want to go into, we can't use a join statement. (lots of legacy tables) Now my approach was to create a delimited string like (with a function):

Value#Project#Lastupdate

Here it doesn't matter what the delimiter is. But given that Value can have ANY value present in there,I can't just use a very complicated separator which MIGHT cover 99.999999 % of all cases but in 1 case out of 100 of millions it wont. So what I did was to escape all occurences of the separator in the string, and then separate them. Now in this example it would look like this:

Hashtag\##testing#123981010

So I escape with ''. Now this all works, but it is really slow. The select statement

Here is my code:

CREATE OR REPLACE FUNCTION unescape_internal( i_str IN VARCHAR2, i_idx IN INTEGER) RETURN VARCHAR2 IS
    v_first VARCHAR2(4000);
    v_second VARCHAR2(4000);
    v_third VARCHAR2(4000);
    v_ret VARCHAR2(4000);
    v_curr_str VARCHAR2(4000);
    v_curr_char VARCHAR2(1);
BEGIN
    v_curr_str := '';
    
    FOR i in 1..LENGTH(i_str)
    LOOP
        v_curr_char := substr(i_str, i, 1);
        IF v_curr_char = '#' AND i = 1 THEN
            IF v_first IS NULL THEN
                v_first := ' ';
            END IF;
        ELSIF v_curr_char = '#' AND substr(i_str, i-1, 1) != '\' THEN
            IF v_first IS NULL THEN
                v_first := v_curr_str;
                v_curr_str := '';
            ELSIF v_second IS NULL THEN
                v_second := v_curr_str;
                v_curr_str := '';
            END IF;
        ELSE
            v_curr_str := v_curr_str || v_curr_char;
        END IF;     
    END LOOP;
    
    IF v_third IS NULL THEN
        v_third := v_curr_str;
        v_curr_str := '';
    END IF;
    
    IF( i_idx = 1 ) THEN
        v_ret := TRIM(NVL(v_first,''));
    ELSIF( i_idx = 2 ) THEN
        v_ret := TRIM(NVL(v_second,''));
    ELSIF( i_idx = 3 ) THEN
        v_ret := TRIM(NVL(v_third,''));
    ELSE 
        v_ret := '';
    END IF;
    
    RETURN v_ret;
END unescape_internal;

The code I used for testing was like this:

    declare
    v_varchar1 VARCHAR2(4000);
    v_varchar2 VARCHAR2(4000);
    v_varchar3 VARCHAR2(4000);
BEGIN
    FOR i in 1..1000000 
    LOOP
        select unescape_internal('123123123123#1dasdyxcsd113\##test' || i, 1), unescape_internal('123123123123#1dasdyxcsd113\##test' || i, 2), unescape_internal('123123123123#1dasdyxcsd113\##test' || i, 3)  into v_varchar1, v_varchar2, v_varchar3 from dual;
    END LOOP;
END;

This takes about 40 seconds. Which in this dataset doesn't sound much, but given that I have very little data, and small column values, this will be much more of a problem. We tried using this function with the actual dataset, before the my changes, the sql took about 300 s to return 250 entries in the sqldeveloper. With my changes, we had to abort after 30 minutes of waiting. Using 68 mil entries as in the real environment, this would take about 45 minutes. The code is slow due to it processing the string multiple times. Since I need every column (we have 3) I pass an index of the column I want to get returned. This results in 3 full processings of the string. Unfortunately using DETERMINISTIC for the function doesn't work, since the index is changed everytime.

I tried using regex first, but unfortunately oracle (12g) doesn't support lookahead/lookbehind regex epressions and thus I am now out of ideas and in hope someone has some kind of idea.

Upvotes: 0

Views: 1538

Answers (1)

MT0
MT0

Reputation: 167982

You can eliminate the context-switch to PL/SQL and do it all in the SQL context using:

SELECT REPLACE( REGEXP_SUBSTR( value, '(\\.|[^#])+', 1, 1 ), '\#', '#' ) AS value1,
       REPLACE( REGEXP_SUBSTR( value, '(\\.|[^#])+', 1, 2 ), '\#', '#' ) AS value2,
       REPLACE( REGEXP_SUBSTR( value, '(\\.|[^#])+', 1, 3 ), '\#', '#' ) AS value3
FROM   table_name

The regular expression (\\.|[^#])+ will start from the left of the string and look for:

  • \\. A backslash character (your escape character) followed by a wildcard character . to match an escaped pair of characters; or
  • [^#] any non-hash character.

That match is done one-or-more times (this is the +) to match the entire term and will stop when it finds a non-escaped hash character and will return the substring consisting of that entire term of non-hash and escaped charaters.

Which, for the sample data:

CREATE TABLE table_name ( value ) AS
SELECT 'Hashtag\##testing#123981010' FROM DUAL

Outputs:

VALUE1   | VALUE2  | VALUE3   
:------- | :------ | :--------
Hashtag# | testing | 123981010

db<>fiddle here


Update:

Another PL/SQL solution is:

CREATE OR REPLACE FUNCTION unescape_internal2(
  i_str IN VARCHAR2,
  i_idx IN INTEGER
) RETURN VARCHAR2 DETERMINISTIC IS
  v_idx        PLS_INTEGER := 0;
  v_escape_pos PLS_INTEGER;
  v_hash_pos   PLS_INTEGER;
  v_start_pos  PLS_INTEGER := 1;
BEGIN
  IF i_str IS NULL OR i_idx < 1 THEN
    RETURN NULL;
  END IF;
  
  LOOP
    v_hash_pos := v_start_pos;
    LOOP
      v_escape_pos := INSTR( i_str, '\#', v_hash_pos );
      v_hash_pos   := INSTR( i_str, '#', v_hash_pos );
      EXIT WHEN v_escape_pos = 0 OR v_hash_pos = 0 OR v_escape_pos > v_hash_pos;
      v_hash_pos := v_escape_pos + 2;
    END LOOP;
    
    v_idx := v_idx + 1;
    
    IF v_idx = i_idx THEN
      IF v_hash_pos = 0 THEN
        RETURN REPLACE( SUBSTR( i_str, v_start_pos ), '\#', '#' );
      ELSE
        RETURN REPLACE( SUBSTR( i_str, v_start_pos, v_hash_pos - v_start_pos ), '\#', '#' );
      END IF;
    ELSIF v_hash_pos = 0 THEN
      RETURN NULL;
    END IF;
    v_start_pos := v_hash_pos + 1;
  END LOOP;
END unescape_internal2;
/

However, it does not seem to be as fast as using a regular expression and avoiding the context-switch from SQL to PL/SQL in the first solution but you should profile all the options on your system and data.

db<>fiddle here

Upvotes: 1

Related Questions