naiveBayes
naiveBayes

Reputation: 35

query to display count of corresponding each distinct word

There is a column in a table which can store up to 4000 characters. So for a given row, we need to write a query to display count of corresponding each distinct word in the sentence.

For e.g. the column has "Jack and Jill went up a hill. Jack came tumbling down"

Output :  
<Word> - <Count> 
Jack - 2 
Jill - 1
hill - 1
and - 1
a - 1
came - 1 ... and so on

Upvotes: 0

Views: 674

Answers (2)

dnoeth
dnoeth

Reputation: 60462

Since you tagged Teradata you can use STRTOK_SPLIT_TO_TABLE for the tokenizing part. Just add more characters to the separators list:

with cte as
(select
    1 as keycol,
    'Jack and Jill went up a hill. Jack came tumbling down' as col)
select keycol, token, count(*) as cnt
FROM TABLE (STRTOK_SPLIT_TO_TABLE(cte.keycol, cte.col, 
            ' .,;:-?!()''"') -- list of separators
     RETURNS (keycol INTEGER,
              tokennum INTEGER,
              token VARCHAR(100) CHARACTER SET UNICODE)
           ) AS d 
group by 1,2
order by 1, cnt desc

But counting words might be much more complicated, as it usually includes tokenizing, stemming and stop words.

Upvotes: 1

Maheswaran Ravisankar
Maheswaran Ravisankar

Reputation: 17920

First , convert the words into rows and then group it.

In this query, we use a basic concept of row generation using CONNECT BY.

For Example:

select level from dual CONNECT BY level <= 10;

The above query would generate 10 rows.(Hierarchical Level query).

Based on this simple logic, now we have to count the number of spaces here, and generate that many rows.REGEXP_COUNT(str,'[^ ]+') would give the number of spaces in the sentence.

And using the level, extract a word from the sentence in each row. REGEXP_SUBSTR(str,'[^ ]+',1,level) would do this.

You can play around with this query to handle other scenarios. Good Luck.

with tokenised_rows(str) as(
SELECT  REGEXP_SUBSTR('Jack and Jill went up a hill. Jack came tumbling down','[^ ]+',1,LEVEL) 
  FROM dual
CONNECT BY level <= REGEXP_COUNT('Jack and Jill went up a hill. Jack came tumbling down','[^ ]+')
)
select str,count(1) from tokenised_rows
group by str;

Upvotes: 0

Related Questions