artemis
artemis

Reputation: 7261

Optimization of text parsing - Oracle SQL

I am working on a SQL query that will count the appearances of certain words in "long text", or a huge text field that is a CLOB data type.

My dataset (which is massive, ~5M+ rows) looks something like this:

http://sqlfiddle.com/#!4/2c13d/1

I have a query, like this:

SELECT
  TheTask AS Tasking,
  SUM(CASE WHEN TRIM(UPPER(TheTaskText)) LIKE '%LONG%' THEN 1 ELSE 0 END) AS LongCount,
  SUM(CASE WHEN TRIM(UPPER(TheTaskText)) LIKE '%TEXT%' THEN 1 ELSE 0 END) AS TextCount,
  SUM(CASE WHEN TRIM(UPPER(TheTaskText)) LIKE '%ENGLISH%' THEN 1 ELSE 0 END) AS EnglishCount
FROM
  example
GROUP BY
  TheTask

However, it takes an extremely long time to run on the complete dataset (~3 hours or so). I believe this is due to LIKE optimization issues, but I am unsure of how else to achieve this goal dataset. I have tried researching other articles on how to optimize like, but is it possible that REGEX or something would be quicker? I am looking to optimize this query by evaluating LIKE performance.

Upvotes: 0

Views: 95

Answers (1)

Barbaros Özhan
Barbaros Özhan

Reputation: 65288

The CONTEXT index type is used to index long texts. You can use :

CREATE INDEX idx_TheTaskTxt ON example(TRIM(UPPER(TheTaskText))) INDEXTYPE IS CTXSYS.CONTEXT;

and collect statistics for the optimizer to take effect :

EXEC DBMS_STATS.GATHER_TABLE_STATS(USER, 'EXAMPLE', cascade=>TRUE);

and call

SELECT
  TheTask AS Tasking,
  SUM(CASE WHEN CONTAINS(TRIM(UPPER(TheTaskText)), 'LONG', 1) > 0 THEN 1 ELSE 0 END) AS LongCount,
  SUM(CASE WHEN CONTAINS(TRIM(UPPER(TheTaskText)), 'TEXT', 1) > 0 THEN 1 ELSE 0 END) AS TextCount,
  SUM(CASE WHEN CONTAINS(TRIM(UPPER(TheTaskText)), 'ENGLISH', 1) > 0 THEN 1 ELSE 0 END) AS EnglishCount
FROM example
GROUP BY TheTask
HAVING 
       SUM(CASE WHEN CONTAINS(TRIM(UPPER(TheTaskText)), 'LONG', 1) > 0 THEN 1 ELSE 0 END) * 
       SUM(CASE WHEN CONTAINS(TRIM(UPPER(TheTaskText)), 'TEXT', 1) > 0 THEN 1 ELSE 0 END) *
       SUM(CASE WHEN CONTAINS(TRIM(UPPER(TheTaskText)), 'ENGLISH', 1) > 0 THEN 1 ELSE 0 END) 
       IN (0,1)

Upvotes: 1

Related Questions