Woodchuck
Woodchuck

Reputation: 4434

Query to count distinct values in Oracle db CLOB column

I would like to query an Oracle DB table for the number of rows containing each distinct value in a CLOB column.

This returns all rows containing a value:

select * from mytable where dbms_lob.instr(mycol,'value') > 0;

Using DBMS_LOB, this returns the number of rows containing that value:

select count(*) from mytable where dbms_lob.instr(mycol,'value') > 0;

But is it possible to query for the number of times (rows in which) each distinct value appears?

Upvotes: 1

Views: 2032

Answers (3)

Pavel Smirnov
Pavel Smirnov

Reputation: 4799

Standard Oracle functions do not support distinction of CLOB values. But, if you have access to DBMS_CRYPTO.HASH function, you can compare CLOB hashes instead, and thus, get the desired output:

select myCol, h.num from
   myTable t join 
      (select min(rowid) rid, count(rowid) num
              from myTable
       where dbms_lob.instr(mycol,'value') > 0
       group by DBMS_CRYPTO.HASH(myCol, 3)) h
   on t.rowid = h.rid;

Also, note, that there's a very little possibility of hash collision. But if that's ok with you, you can use this approach.

Upvotes: 2

Alex Poole
Alex Poole

Reputation: 191415

If your CLOB values are more than 4000 bytes (and if not, why are they CLOBs?) then it's not perfect - collisions are possible, if unlikely - but you could hash the CLOB values.

If you want to count the number of distinct values:

select count(distinct dbms_crypto.hash(src=>mycol, typ=>2))
from mytable
where dbms_lob.instr(mycol,'value') > 0;

If you want to count how many times each distinct value appears:

select mycol, cnt
from (
  select mycol,
    count(*) over (partition by dbms_crypto.hash(src=>mycol, typ=>2)) as cnt,
    row_number() over (partition by dbms_crypto.hash(src=>mycol, typ=>2) order by null) as rn
  from mytable
  where dbms_lob.instr(mycol,'value') > 0
)
where rn = 1;

Both are likely to be fairly expensive and slow with a lot of data.

(typ=>2 gives the numeric value for dbms_crypto.hash_md5, as you can't refer to the package constant in a SQL call, at least up to 12cR1...)


Rather more crudely, but possibly significantly quicker, you could base the count on the just the first 4000 characters - which may or may not be plausible for your actual data:

select count(distinct dbms_lob.substr(mycol, 4000, 1))
from mytable
where dbms_lob.instr(mycol,'value') > 0;

select dbms_lob.substr(mycol, 4000, 1), count(*)
from mytable
where dbms_lob.instr(mycol,'value') > 0
group by dbms_lob.substr(mycol, 4000, 1);

Upvotes: 2

Littlefoot
Littlefoot

Reputation: 142968

Depending on what that column really contains, see whether TO_CHAR helps.

SQL> create table mytable (mycol clob);

Table created.

SQL> insert into mytable
  2    select 'Query to count distinct values' from dual union all
  3    select 'I have no idea which values are popular' from dual;

2 rows created.

SQL>  select count(*), to_char(mycol) toc
  2  from mytable
  3  where dbms_lob.instr(mycol,'value') > 0
  4  group by to_char(mycol);

  COUNT(*) TOC
---------- ----------------------------------------
         1 Query to count distinct values
         1 I have no idea which values are popular

SQL>

Upvotes: 2

Related Questions