S.Krishna
S.Krishna

Reputation: 876

Replacing Text which does not match a pattern in Oracle

I have below text in a CLOB in a table

Table Name: tbl1
Columns
col1 - number (Primary Key)
col2 - clob (as below)

Row#1
-----
Col1 = 1
Col2 =
1331882981,ab123456,Some text here
which can run multiple lines and have a lot of text...
~1331890329,pqr123223,Some more text...

Row#2
-----
Col1 = 2
Col2 =
1331882981,abc333,Some text here
which can run multiple lines and have a lot of text...
~1331890329,pqrs23,Some more text...


Now I need to know how we can get below output

Col1 Value
---- ---------------------
1 1331882981,ab123456
1 1331890329,pqr123223
2 1331882981,abc333
2 1331890329,pqrs23


([0-9]{10},[a-z 0-9]+.), ==> This is the regular expression to match "1331890329,pqrs23" and I need to know how can replace which are not matching this regex and then split them into multiple rows

EDIT#1
I am on Oracle 10.2.0.5.0 and hence cannot use REGEXP_COUNT function :-( Also, the col2 is a CLOB which is massive

EDIT#2
I've tried below query and it works fine for some records (i.e. if I add a "where" clause). But when I remove the "where", it never returns any result. I've tried to put this into a view and insert into a table and left it run overnight but still it had not completed :(

with t as (select col1, col2 from temp_table)
select col1,
       cast(substr(regexp_substr(col2, '[^~]+', 1, level), 1, 50) as
            varchar2(50)) data
  from t
connect by level <= length(col2) - length(replace(col2, '~')) + 1



EDIT#3

# of Chars in Clob      Total
-----------             -----
0 - 1k                  3196
1k - 5k                 2865
5k - 25k                 661
25k - 100k                36
> 100k                     2
-----------             -----
Grand Total             6760

I have ~7k rows of clobs which have the distribution as shown above...

Upvotes: 1

Views: 2921

Answers (2)

S.Krishna
S.Krishna

Reputation: 876

The above solutions didn't work and below is what I did.

update temp_table set col2=regexp_replace(col2,'([0-9]{10},[a-z0-9]+)','(\1)') ;
update temp_table set col2=regexp_replace(col2,'\),[\s\S]*~\(','(\1)$');
update temp_table set col2=regexp_replace(col2,'\).*?\(','$');
update temp_table set col2=replace(regexp_replace(col2,'\).*',''),'(','');

After these 4 update commands, the col2 will have something like

1 1331882981,ab123456$1331890329,pqr123223
2 1331882981,abc333$1331890329,pqrs23

Then I wrote a function to split this thing. The reason I went for the function is to split by "$" and the fact that the col2 still has >10k characters

create or replace function parse( p_clob in clob ) return sys.odciVarchar2List
pipelined
as
        l_offset number := 1;
        l_clob   clob := translate( p_clob, chr(13)|| chr(10) || chr(9), '   ' ) || '$';
        l_hit    number;
begin
        loop
          --Find occurance of "$" from l_offset
          l_hit := instr( l_clob, '$', l_offset );
          exit when nvl(l_hit,0) = 0;
          --Extract string from l_offset to l_hit
          pipe row ( substr(l_clob, l_offset , (l_hit - l_offset)) );
          --Move offset
          l_offset := l_hit+1;
        end loop;
end;

I then called

select col1,
       REGEXP_SUBSTR(column_value, '[^,]+', 1, 1) col3,
       REGEXP_SUBSTR(column_value, '[^,]+', 1, 2) col4
  from temp_table, table(parse(temp_table.col2));

Upvotes: 0

Emmanuel
Emmanuel

Reputation: 14209

Well, you could try something like:

with v as
(
  select 1 col1, '1331882981,ab123456,Some text here
which can run multiple lines and have a lot of text...
~1331890329,pqr123223,Some more text...' col2 from dual
  union all
  select 2 col1, '133188298777,abc333,Some text here
which can run multiple lines and have a lot of text...
~1331890329,pqrs23,Some more text...' col2 from dual
)
select distinct col1, regexp_substr(col2, '([0-9]{10},[a-z 0-9]+)', 1, level) split
from v
connect by level <= REGEXP_COUNT(col2, '([0-9]{10},[a-z0-9]+)')
order by col1
;

This gives:

1   1331882981,ab123456
1   1331890329,pqr123223
2   1331890329,pqrs23
2   3188298777,abc333

EDIT : for 10g, REGEXP_COUNT does not exist but you have workarounds. Here I replace the pattern found by something I hope I won't find in the text (here, XYZXYZ but you can choose something much more complex to be confident), do a diff with the same matching but replaced by the empty string, then divide by my pattern length (here, 6):

with v as
(
  select 1 col1, '1331882981,ab123456,Some text here
which can run multiple lines and have a lot of text...
~1331890329,pqr123223,Some more text...' col2 from dual
  union all
  select 2 col1, '133188298777,abc333,Some text here
which can run multiple lines and have a lot of text...
~1331890329,pqrs23,Some more text...' col2 from dual
)
select distinct col1, regexp_substr(col2, '([0-9]{10},[a-z 0-9]+)', 1, level) split
from v
connect by level <= (length(REGEXP_REPLACE(col2, '([0-9]{10},[a-z 0-9]+)', 'XYZXYZ')) - length(REGEXP_REPLACE(col2, '([0-9]{10},[a-z 0-9]+)', ''))) / 6
order by col1
;

EDIT 2 : CLOBs (and LOBs in general) and regexp don't seem to fit well together:

ORA-00932: inconsistent datatypes: expected - got CLOB

Converting the CLOG to a string (regexp_substr(to_char(col2), ...) seems to fix the issue.

EDIT 3 : CLOBs don't like distinct either, so converting split result to char in an embedded request and then using the distinct on the upper request succeeds !

select distinct col1, split from
(
    select col1, to_char(regexp_substr(col2, '([0-9]{10},[a-z 0-9]+)', 1, level)) split
    from temp_epn
    connect by level <= (length(REGEXP_REPLACE(col2, '([0-9]{10},[a-z 0-9]+)', 'XYZXYZ')) - length(REGEXP_REPLACE(col2, '([0-9]{10},[a-z 0-9]+)', ''))) / 6
    order by col1
);

Upvotes: 1

Related Questions