Reputation: 2839
This question has been asked more than a few times, but I can't find the specific answer I need. I have a query that finds the most commonly appearing words in a column in SQL Server and lists them with the count of their appearances. The problem is that if a word appears multiple times in a row, it counts once for each appearance. I would like to only count each word once per row.
So a row with a value of "To be or not to be" would count 'to' and 'be' once each, not twice each for purposes of overall frequency.
Here is the current query, which also strips out common words such as pronouns and replaces all of the commonly occurring separators with spaces. It's a bit old so I suspect it could be a lot neater.
SELECT sep.Col Phrase, count(*) as Qty
FROM (
Select * FROM (
Select value = Upper(RTrim(LTrim(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Title, ',', ' '), '.', ' '), '!', ' '), '+', ' '), ':', ' '), '-', ' '), ';', ' '), '(', ' '), ')', ' '), '/', ' '), '&', ''), '?', ' '), ' ', ' '), ' ', ' '))))
FROM Table
) easyValues
Where value <> ''
) actualValues
Cross Apply dbo.SeparateValues(value, ' ') sep
WHERE sep.Col not in ('', 'THE', 'A', 'AN', 'WHO', 'BOOK', 'AND', 'FOR', 'ON', 'HAVE', 'YOUR', 'HOW', 'WE', 'IN', 'I', 'IT', 'BY', 'SO', 'THEIR', 'IS', 'OR', 'HE', 'OF', 'WHAT'
, 'HIM', 'HIS', 'SHE', 'HER', 'MY', 'FROM', 'US', 'OUR', 'AT', 'ALL', 'BE', 'OF', 'TO', 'YOU', 'WITH', 'THAT', 'THIS', 'WAS', 'ARE', 'THERE', 'BUT', 'HAS'
, '1', '2', '3', '4', '5', '6', '7', '8', '9', '0', 'WILL', 'MORE', 'DIV', 'THAN', 'EACH', 'GET', 'ANY')
and LEN(sep.Col) > 2
GROUP By sep.Col
HAVING count(*) > 1
Appreciate any thoughts on a better way to do this while fixing the issue of repeat words.
Upvotes: 2
Views: 499
Reputation: 16137
As far as I can tell, the STRING_SPLIT function along with CROSS APPLY can give you what you want. You can split the string based on the space delimiter, select each word distinctly, then count in an outer query. I ommitted the part where you don't select specific words for brevity.
CREATE TABLE phrases(phrase NVARCHAR(MAX));
INSERT INTO phrases(phrase)VALUES(N'To be or not to be'),(N'this is not a phrase'),(N'And why is this not another one');
SELECT
w.value,
COUNT(*)
FROM
phrases AS p
CROSS APPLY (
SELECT DISTINCT
value
FROM
STRING_SPLIT(p.phrase,N' ')
) AS w
GROUP BY
w.value;
Upvotes: 1
Reputation: 32695
You just need to GROUP BY
twice.
First by sep.Col
and Table.ID
to remove duplicates in a row. Your table has some ID
column, right?
Second, just by sep.Col
to get the final count.
I have also rewritten your query using CTEs to make it readable. At least, for me it is more readable in this way.
WITH
easyValues
AS
(
Select
ID
,value = Upper(RTrim(LTrim(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Title, ',', ' '), '.', ' '), '!', ' '), '+', ' '), ':', ' '), '-', ' '), ';', ' '), '(', ' '), ')', ' '), '/', ' '), '&', ''), '?', ' '), ' ', ' '), ' ', ' '))))
FROM Table
)
,actualValues
AS
(
SELECT
ID
,Value
FROM easyValues
Where value <> ''
)
,SeparateValues
AS
(
SELECT
ID
,sep.Col
FROM
actualValues
Cross Apply dbo.SeparateValues(value, ' ') AS sep
WHERE
sep.Col not in ('', 'THE', 'A', 'AN', 'WHO', 'BOOK', 'AND', 'FOR', 'ON', 'HAVE', 'YOUR', 'HOW', 'WE', 'IN', 'I', 'IT', 'BY', 'SO', 'THEIR', 'IS', 'OR', 'HE', 'OF', 'WHAT'
, 'HIM', 'HIS', 'SHE', 'HER', 'MY', 'FROM', 'US', 'OUR', 'AT', 'ALL', 'BE', 'OF', 'TO', 'YOU', 'WITH', 'THAT', 'THIS', 'WAS', 'ARE', 'THERE', 'BUT', 'HAS'
, '1', '2', '3', '4', '5', '6', '7', '8', '9', '0', 'WILL', 'MORE', 'DIV', 'THAN', 'EACH', 'GET', 'ANY')
and LEN(sep.Col) > 2
)
,UniqueValues
AS
(
SELECT
ID, Col
FROM
SeparateValues
GROUP BY
ID, Col
)
SELECT
Col AS Phrase
,count(*) as Qty
FROM UniqueValues
GROUP By Col
HAVING count(*) > 1
;
Upvotes: 2
Reputation: 16908
To achieve your requirement, you can use a FUNCTION to split a string into list of words by delimiter ' ' space. With the help of this function, you can then use some Dynamic SQL like cursor to get the final count.
First create the FUNCTION as- Source Of code: stackoverflow
CREATE FUNCTION dbo.splitstring ( @stringToSplit VARCHAR(MAX) )
RETURNS @returnList TABLE ([Word] [nvarchar] (500))
AS
BEGIN
DECLARE @name NVARCHAR(255)
DECLARE @pos INT
WHILE CHARINDEX(' ', @stringToSplit) > 0
BEGIN
SELECT @pos = CHARINDEX(' ', @stringToSplit)
SELECT @name = SUBSTRING(@stringToSplit, 1, @pos-1)
INSERT INTO @returnList
SELECT @name
SELECT @stringToSplit = SUBSTRING(@stringToSplit, @pos+1, LEN(@stringToSplit)-@pos)
END
INSERT INTO @returnList
SELECT @stringToSplit
RETURN
END
Then use this CURSOR script to get your final output-
DECLARE @Value VARCHAR(MAX)
DECLARE @WordList TABLE
(
Word VARCHAR(200)
)
DECLARE db_cursor CURSOR
FOR
SELECT Upper(RTrim(LTrim(Replace(Replace(Replace(Replace(Replace
(Replace(Replace(Replace(Replace(Replace(Replace(Replace
(Replace(Replace(title, ',', ' '), '.', ' '), '!', ' '), '+', ' '), ':', ' '), '-', ' '), ';', ' ')
, '(', ' '), ')', ' '), '/', ' '), '&', ''), '?', ' '), ' ', ' '), ' ', ' ')))) [Value]
FROM table
OPEN db_cursor
FETCH NEXT FROM db_cursor INTO @Value
WHILE @@FETCH_STATUS = 0
BEGIN
INSERT INTO @WordList
SELECT DISTINCT Word FROM [dbo].[splitstring](@Value)
WHERE Word NOT IN ('', 'THE', 'A', 'AN', 'WHO', 'BOOK', 'AND', 'FOR', 'ON', 'HAVE', 'YOUR', 'HOW', 'WE', 'IN', 'I', 'IT', 'BY', 'SO', 'THEIR', 'IS', 'OR', 'HE', 'OF', 'WHAT'
, 'HIM', 'HIS', 'SHE', 'HER', 'MY', 'FROM', 'US', 'OUR', 'AT', 'ALL', 'BE', 'OF', 'TO', 'YOU', 'WITH', 'THAT', 'THIS', 'WAS', 'ARE', 'THERE', 'BUT', 'HAS'
, '1', '2', '3', '4', '5', '6', '7', '8', '9', '0', 'WILL', 'MORE', 'DIV', 'THAN', 'EACH', 'GET', 'ANY')
AND LEN(Word) > 2
FETCH NEXT FROM db_cursor INTO @Value
END
CLOSE db_cursor
DEALLOCATE db_cursor
SELECT Word,COUNT(*)
FROM @WordList
GROUP BY Word
Upvotes: 1