Reputation: 73
For example if I have 1000 rows of data that has customer ID (e.g. 123) and their comments on our product (e.g. great product easy use)
How do I use Teradata (version 15) to do a word frequency count so that the output has two columns one with the word and the other with frequency e.g. (Great: 20, Product: 10)?
Thank you
Upvotes: 1
Views: 2680
Reputation: 50034
You could use strtok_split_to_table
to pull this off.
Something like the following:
SELECT d.token, SUM(d.outkey)
FROM TABLE (strtok_split_to_table(1, <yourtable>.<yourcommentsfield>, ' ')
RETURNS (outkey integer, tokennum integer, token varchar(20)character set unicode) ) as d
GROUP BY 1
This will split each word in your comments field into individual records, then it counts the occurrence of each word. Just stick your own <yourtable>.<yourcommentsfield>
in there and you should be good to go.
More information on strtok_split_to_table: http://www.info.teradata.com/HTMLPubs/DB_TTU_14_00/index.html#page/SQL_Reference/B035_1145_111A/String_Ops_Funcs.084.242.html
Here is the SQL and results for a test on my system:
CREATE SET TABLE db.testcloud ,NO FALLBACK ,
NO BEFORE JOURNAL,
NO AFTER JOURNAL,
CHECKSUM = DEFAULT,
DEFAULT MERGEBLOCKRATIO
(
customer VARCHAR(10) CHARACTER SET LATIN NOT CASESPECIFIC,
comments VARCHAR(1000) CHARACTER SET LATIN NOT CASESPECIFIC)
PRIMARY INDEX ( customer );
INSERT INTO testcloud (1, 'This is a test comment');
INSERT INTO testcloud (2, 'This is also comment of something');
SELECT d.token, SUM(d.outkey)
FROM TABLE (TD_SYSFNLIB.strtok_split_to_table(1, testcloud.comments, ' -/')
RETURNS (outkey integer, tokennum integer, token varchar(20)character set unicode) ) as d
GROUP BY 1
--token Sum(outkey)
--is 2
--also 1
--This 2
--of 1
--test 1
--a 1
--comment 2
--something 1
Upvotes: 3