Reputation: 41
I've got a rough table structure like this:
ID | Value | Project | Last update |
---|---|---|---|
1 | Hashtag# | testing | 123981010 |
2 | I like,it | admin | 123129319 |
We have a very complex mechanism which results in a new table which needs these 3 columns. (has about 68 mil entries) For reasons I don't want to go into, we can't use a join statement. (lots of legacy tables) Now my approach was to create a delimited string like (with a function):
Value#Project#Lastupdate
Here it doesn't matter what the delimiter is. But given that Value can have ANY value present in there,I can't just use a very complicated separator which MIGHT cover 99.999999 % of all cases but in 1 case out of 100 of millions it wont. So what I did was to escape all occurences of the separator in the string, and then separate them. Now in this example it would look like this:
Hashtag\##testing#123981010
So I escape with ''. Now this all works, but it is really slow. The select statement
Here is my code:
CREATE OR REPLACE FUNCTION unescape_internal( i_str IN VARCHAR2, i_idx IN INTEGER) RETURN VARCHAR2 IS
v_first VARCHAR2(4000);
v_second VARCHAR2(4000);
v_third VARCHAR2(4000);
v_ret VARCHAR2(4000);
v_curr_str VARCHAR2(4000);
v_curr_char VARCHAR2(1);
BEGIN
v_curr_str := '';
FOR i in 1..LENGTH(i_str)
LOOP
v_curr_char := substr(i_str, i, 1);
IF v_curr_char = '#' AND i = 1 THEN
IF v_first IS NULL THEN
v_first := ' ';
END IF;
ELSIF v_curr_char = '#' AND substr(i_str, i-1, 1) != '\' THEN
IF v_first IS NULL THEN
v_first := v_curr_str;
v_curr_str := '';
ELSIF v_second IS NULL THEN
v_second := v_curr_str;
v_curr_str := '';
END IF;
ELSE
v_curr_str := v_curr_str || v_curr_char;
END IF;
END LOOP;
IF v_third IS NULL THEN
v_third := v_curr_str;
v_curr_str := '';
END IF;
IF( i_idx = 1 ) THEN
v_ret := TRIM(NVL(v_first,''));
ELSIF( i_idx = 2 ) THEN
v_ret := TRIM(NVL(v_second,''));
ELSIF( i_idx = 3 ) THEN
v_ret := TRIM(NVL(v_third,''));
ELSE
v_ret := '';
END IF;
RETURN v_ret;
END unescape_internal;
The code I used for testing was like this:
declare
v_varchar1 VARCHAR2(4000);
v_varchar2 VARCHAR2(4000);
v_varchar3 VARCHAR2(4000);
BEGIN
FOR i in 1..1000000
LOOP
select unescape_internal('123123123123#1dasdyxcsd113\##test' || i, 1), unescape_internal('123123123123#1dasdyxcsd113\##test' || i, 2), unescape_internal('123123123123#1dasdyxcsd113\##test' || i, 3) into v_varchar1, v_varchar2, v_varchar3 from dual;
END LOOP;
END;
This takes about 40 seconds. Which in this dataset doesn't sound much, but given that I have very little data, and small column values, this will be much more of a problem. We tried using this function with the actual dataset, before the my changes, the sql took about 300 s to return 250 entries in the sqldeveloper. With my changes, we had to abort after 30 minutes of waiting. Using 68 mil entries as in the real environment, this would take about 45 minutes. The code is slow due to it processing the string multiple times. Since I need every column (we have 3) I pass an index of the column I want to get returned. This results in 3 full processings of the string. Unfortunately using DETERMINISTIC for the function doesn't work, since the index is changed everytime.
I tried using regex first, but unfortunately oracle (12g) doesn't support lookahead/lookbehind regex epressions and thus I am now out of ideas and in hope someone has some kind of idea.
Upvotes: 0
Views: 1538
Reputation: 167982
You can eliminate the context-switch to PL/SQL and do it all in the SQL context using:
SELECT REPLACE( REGEXP_SUBSTR( value, '(\\.|[^#])+', 1, 1 ), '\#', '#' ) AS value1,
REPLACE( REGEXP_SUBSTR( value, '(\\.|[^#])+', 1, 2 ), '\#', '#' ) AS value2,
REPLACE( REGEXP_SUBSTR( value, '(\\.|[^#])+', 1, 3 ), '\#', '#' ) AS value3
FROM table_name
The regular expression (\\.|[^#])+
will start from the left of the string and look for:
\\.
A backslash character (your escape character) followed by a wildcard character .
to match an escaped pair of characters; or[^#]
any non-hash character.That match is done one-or-more times (this is the +
) to match the entire term and will stop when it finds a non-escaped hash character and will return the substring consisting of that entire term of non-hash and escaped charaters.
Which, for the sample data:
CREATE TABLE table_name ( value ) AS
SELECT 'Hashtag\##testing#123981010' FROM DUAL
Outputs:
VALUE1 | VALUE2 | VALUE3 :------- | :------ | :-------- Hashtag# | testing | 123981010
db<>fiddle here
Another PL/SQL solution is:
CREATE OR REPLACE FUNCTION unescape_internal2(
i_str IN VARCHAR2,
i_idx IN INTEGER
) RETURN VARCHAR2 DETERMINISTIC IS
v_idx PLS_INTEGER := 0;
v_escape_pos PLS_INTEGER;
v_hash_pos PLS_INTEGER;
v_start_pos PLS_INTEGER := 1;
BEGIN
IF i_str IS NULL OR i_idx < 1 THEN
RETURN NULL;
END IF;
LOOP
v_hash_pos := v_start_pos;
LOOP
v_escape_pos := INSTR( i_str, '\#', v_hash_pos );
v_hash_pos := INSTR( i_str, '#', v_hash_pos );
EXIT WHEN v_escape_pos = 0 OR v_hash_pos = 0 OR v_escape_pos > v_hash_pos;
v_hash_pos := v_escape_pos + 2;
END LOOP;
v_idx := v_idx + 1;
IF v_idx = i_idx THEN
IF v_hash_pos = 0 THEN
RETURN REPLACE( SUBSTR( i_str, v_start_pos ), '\#', '#' );
ELSE
RETURN REPLACE( SUBSTR( i_str, v_start_pos, v_hash_pos - v_start_pos ), '\#', '#' );
END IF;
ELSIF v_hash_pos = 0 THEN
RETURN NULL;
END IF;
v_start_pos := v_hash_pos + 1;
END LOOP;
END unescape_internal2;
/
However, it does not seem to be as fast as using a regular expression and avoiding the context-switch from SQL to PL/SQL in the first solution but you should profile all the options on your system and data.
db<>fiddle here
Upvotes: 1