Reputation: 7261
I am working on a SQL query that will count the appearances of certain words in "long text", or a huge text field that is a CLOB
data type.
My dataset (which is massive, ~5M+ rows) looks something like this:
http://sqlfiddle.com/#!4/2c13d/1
I have a query, like this:
SELECT
TheTask AS Tasking,
SUM(CASE WHEN TRIM(UPPER(TheTaskText)) LIKE '%LONG%' THEN 1 ELSE 0 END) AS LongCount,
SUM(CASE WHEN TRIM(UPPER(TheTaskText)) LIKE '%TEXT%' THEN 1 ELSE 0 END) AS TextCount,
SUM(CASE WHEN TRIM(UPPER(TheTaskText)) LIKE '%ENGLISH%' THEN 1 ELSE 0 END) AS EnglishCount
FROM
example
GROUP BY
TheTask
However, it takes an extremely long time to run on the complete dataset (~3 hours or so). I believe this is due to LIKE optimization issues, but I am unsure of how else to achieve this goal dataset. I have tried researching other articles on how to optimize like, but is it possible that REGEX
or something would be quicker? I am looking to optimize this query by evaluating LIKE
performance.
Upvotes: 0
Views: 95
Reputation: 65288
The CONTEXT
index type is used to index long texts. You can use :
CREATE INDEX idx_TheTaskTxt ON example(TRIM(UPPER(TheTaskText))) INDEXTYPE IS CTXSYS.CONTEXT;
and collect statistics for the optimizer to take effect :
EXEC DBMS_STATS.GATHER_TABLE_STATS(USER, 'EXAMPLE', cascade=>TRUE);
and call
SELECT
TheTask AS Tasking,
SUM(CASE WHEN CONTAINS(TRIM(UPPER(TheTaskText)), 'LONG', 1) > 0 THEN 1 ELSE 0 END) AS LongCount,
SUM(CASE WHEN CONTAINS(TRIM(UPPER(TheTaskText)), 'TEXT', 1) > 0 THEN 1 ELSE 0 END) AS TextCount,
SUM(CASE WHEN CONTAINS(TRIM(UPPER(TheTaskText)), 'ENGLISH', 1) > 0 THEN 1 ELSE 0 END) AS EnglishCount
FROM example
GROUP BY TheTask
HAVING
SUM(CASE WHEN CONTAINS(TRIM(UPPER(TheTaskText)), 'LONG', 1) > 0 THEN 1 ELSE 0 END) *
SUM(CASE WHEN CONTAINS(TRIM(UPPER(TheTaskText)), 'TEXT', 1) > 0 THEN 1 ELSE 0 END) *
SUM(CASE WHEN CONTAINS(TRIM(UPPER(TheTaskText)), 'ENGLISH', 1) > 0 THEN 1 ELSE 0 END)
IN (0,1)
Upvotes: 1