Reputation: 35
There is a column in a table which can store up to 4000 characters. So for a given row, we need to write a query to display count of corresponding each distinct word in the sentence.
For e.g. the column has "Jack and Jill went up a hill. Jack came tumbling down"
Output :
<Word> - <Count>
Jack - 2
Jill - 1
hill - 1
and - 1
a - 1
came - 1 ... and so on
Upvotes: 0
Views: 674
Reputation: 60462
Since you tagged Teradata
you can use STRTOK_SPLIT_TO_TABLE
for the tokenizing part. Just add more characters to the separators list:
with cte as
(select
1 as keycol,
'Jack and Jill went up a hill. Jack came tumbling down' as col)
select keycol, token, count(*) as cnt
FROM TABLE (STRTOK_SPLIT_TO_TABLE(cte.keycol, cte.col,
' .,;:-?!()''"') -- list of separators
RETURNS (keycol INTEGER,
tokennum INTEGER,
token VARCHAR(100) CHARACTER SET UNICODE)
) AS d
group by 1,2
order by 1, cnt desc
But counting words might be much more complicated, as it usually includes tokenizing, stemming and stop words.
Upvotes: 1
Reputation: 17920
First , convert the words into rows and then group it.
In this query, we use a basic concept of row generation using CONNECT BY
.
For Example:
select level from dual CONNECT BY level <= 10;
The above query would generate 10 rows.(Hierarchical Level query).
Based on this simple logic, now we have to count the number of spaces here, and generate that many rows.REGEXP_COUNT(str,'[^ ]+')
would give the number of spaces in the sentence.
And using the level, extract a word from the sentence in each row. REGEXP_SUBSTR(str,'[^ ]+',1,level)
would do this.
You can play around with this query to handle other scenarios. Good Luck.
with tokenised_rows(str) as(
SELECT REGEXP_SUBSTR('Jack and Jill went up a hill. Jack came tumbling down','[^ ]+',1,LEVEL)
FROM dual
CONNECT BY level <= REGEXP_COUNT('Jack and Jill went up a hill. Jack came tumbling down','[^ ]+')
)
select str,count(1) from tokenised_rows
group by str;
Upvotes: 0