nycdan
nycdan

Reputation: 2839

SQL Server - Find frequency of occurance (by row, not word) of most common words in a column

This question has been asked more than a few times, but I can't find the specific answer I need. I have a query that finds the most commonly appearing words in a column in SQL Server and lists them with the count of their appearances. The problem is that if a word appears multiple times in a row, it counts once for each appearance. I would like to only count each word once per row.

So a row with a value of "To be or not to be" would count 'to' and 'be' once each, not twice each for purposes of overall frequency.

Here is the current query, which also strips out common words such as pronouns and replaces all of the commonly occurring separators with spaces. It's a bit old so I suspect it could be a lot neater.

    SELECT   sep.Col Phrase, count(*) as Qty
    FROM (
        Select * FROM (
            Select value = Upper(RTrim(LTrim(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Title, ',', ' '), '.', ' '), '!', ' '), '+', ' '), ':', ' '), '-', ' '), ';', ' '), '(', ' '), ')', ' '), '/', ' '), '&', ''), '?', ' '), '  ', ' '), '  ', ' ')))) 
            FROM Table
        ) easyValues
        Where value <> ''
        ) actualValues 
        Cross Apply dbo.SeparateValues(value, ' ') sep
    WHERE sep.Col not in ('', 'THE', 'A', 'AN', 'WHO', 'BOOK', 'AND', 'FOR', 'ON', 'HAVE', 'YOUR', 'HOW', 'WE', 'IN', 'I', 'IT', 'BY', 'SO', 'THEIR', 'IS', 'OR', 'HE', 'OF', 'WHAT'
                        , 'HIM', 'HIS', 'SHE', 'HER', 'MY', 'FROM', 'US', 'OUR', 'AT', 'ALL', 'BE', 'OF', 'TO', 'YOU', 'WITH', 'THAT', 'THIS', 'WAS', 'ARE', 'THERE', 'BUT', 'HAS'
                        , '1', '2', '3', '4', '5', '6', '7', '8', '9', '0', 'WILL', 'MORE', 'DIV', 'THAN', 'EACH', 'GET', 'ANY')
          and LEN(sep.Col) > 2
    GROUP By sep.Col
    HAVING count(*) > 1

Appreciate any thoughts on a better way to do this while fixing the issue of repeat words.

Upvotes: 2

Views: 499

Answers (3)

TT.
TT.

Reputation: 16137

As far as I can tell, the STRING_SPLIT function along with CROSS APPLY can give you what you want. You can split the string based on the space delimiter, select each word distinctly, then count in an outer query. I ommitted the part where you don't select specific words for brevity.

Fiddle<>:

CREATE TABLE phrases(phrase NVARCHAR(MAX));

INSERT INTO phrases(phrase)VALUES(N'To be or not to be'),(N'this is not a phrase'),(N'And why is this not another one');

SELECT 
    w.value,
    COUNT(*) 
FROM 
    phrases AS p 
    CROSS APPLY (
        SELECT DISTINCT 
            value 
         FROM 
            STRING_SPLIT(p.phrase,N' ')
    ) AS w
GROUP BY 
    w.value;

Upvotes: 1

Vladimir Baranov
Vladimir Baranov

Reputation: 32695

You just need to GROUP BY twice.

First by sep.Col and Table.ID to remove duplicates in a row. Your table has some ID column, right?

Second, just by sep.Col to get the final count.

I have also rewritten your query using CTEs to make it readable. At least, for me it is more readable in this way.

WITH
easyValues
AS
(
    Select
        ID
        ,value = Upper(RTrim(LTrim(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Title, ',', ' '), '.', ' '), '!', ' '), '+', ' '), ':', ' '), '-', ' '), ';', ' '), '(', ' '), ')', ' '), '/', ' '), '&', ''), '?', ' '), '  ', ' '), '  ', ' ')))) 
    FROM Table
)
,actualValues
AS
(
    SELECT
        ID
        ,Value
    FROM easyValues
    Where value <> ''
)
,SeparateValues
AS
(
    SELECT
        ID
        ,sep.Col
    FROM
        actualValues
        Cross Apply dbo.SeparateValues(value, ' ') AS sep
    WHERE
        sep.Col not in ('', 'THE', 'A', 'AN', 'WHO', 'BOOK', 'AND', 'FOR', 'ON', 'HAVE', 'YOUR', 'HOW', 'WE', 'IN', 'I', 'IT', 'BY', 'SO', 'THEIR', 'IS', 'OR', 'HE', 'OF', 'WHAT'
                        , 'HIM', 'HIS', 'SHE', 'HER', 'MY', 'FROM', 'US', 'OUR', 'AT', 'ALL', 'BE', 'OF', 'TO', 'YOU', 'WITH', 'THAT', 'THIS', 'WAS', 'ARE', 'THERE', 'BUT', 'HAS'
                        , '1', '2', '3', '4', '5', '6', '7', '8', '9', '0', 'WILL', 'MORE', 'DIV', 'THAN', 'EACH', 'GET', 'ANY')
        and LEN(sep.Col) > 2
)
,UniqueValues
AS
(
    SELECT
        ID, Col
    FROM
        SeparateValues
    GROUP BY
        ID, Col
)
SELECT
    Col AS Phrase
    ,count(*) as Qty
FROM UniqueValues
GROUP By Col
HAVING count(*) > 1
;

Upvotes: 2

mkRabbani
mkRabbani

Reputation: 16908

To achieve your requirement, you can use a FUNCTION to split a string into list of words by delimiter ' ' space. With the help of this function, you can then use some Dynamic SQL like cursor to get the final count.

First create the FUNCTION as- Source Of code: stackoverflow

CREATE  FUNCTION dbo.splitstring ( @stringToSplit VARCHAR(MAX) )
RETURNS @returnList TABLE ([Word] [nvarchar] (500))
AS
BEGIN
    DECLARE @name NVARCHAR(255)
    DECLARE @pos INT

    WHILE CHARINDEX(' ', @stringToSplit) > 0
    BEGIN
    SELECT @pos  = CHARINDEX(' ', @stringToSplit)  
    SELECT @name = SUBSTRING(@stringToSplit, 1, @pos-1)

    INSERT INTO @returnList 
    SELECT @name

    SELECT @stringToSplit = SUBSTRING(@stringToSplit, @pos+1, LEN(@stringToSplit)-@pos)
END

INSERT INTO @returnList
SELECT @stringToSplit

RETURN
END

Then use this CURSOR script to get your final output-

DECLARE @Value VARCHAR(MAX)
DECLARE @WordList TABLE
(
  Word VARCHAR(200)
)

DECLARE db_cursor CURSOR 
FOR 
SELECT Upper(RTrim(LTrim(Replace(Replace(Replace(Replace(Replace
                        (Replace(Replace(Replace(Replace(Replace(Replace(Replace
                        (Replace(Replace(title, ',', ' '), '.', ' '), '!', ' '), '+', ' '), ':', ' '), '-', ' '), ';', ' ')
                        , '(', ' '), ')', ' '), '/', ' '), '&', ''), '?', ' '), '  ', ' '), '  ', ' ')))) [Value]
FROM table

OPEN db_cursor  
FETCH NEXT FROM db_cursor INTO @Value  

WHILE @@FETCH_STATUS = 0  
BEGIN  

    INSERT INTO @WordList
    SELECT DISTINCT Word FROM [dbo].[splitstring](@Value)
    WHERE Word NOT IN ('', 'THE', 'A', 'AN', 'WHO', 'BOOK', 'AND', 'FOR', 'ON', 'HAVE', 'YOUR', 'HOW', 'WE', 'IN', 'I', 'IT', 'BY', 'SO', 'THEIR', 'IS', 'OR', 'HE', 'OF', 'WHAT'
                    , 'HIM', 'HIS', 'SHE', 'HER', 'MY', 'FROM', 'US', 'OUR', 'AT', 'ALL', 'BE', 'OF', 'TO', 'YOU', 'WITH', 'THAT', 'THIS', 'WAS', 'ARE', 'THERE', 'BUT', 'HAS'
                    , '1', '2', '3', '4', '5', '6', '7', '8', '9', '0', 'WILL', 'MORE', 'DIV', 'THAN', 'EACH', 'GET', 'ANY')
    AND LEN(Word) > 2

    FETCH NEXT FROM db_cursor INTO @Value 
END 

CLOSE db_cursor  
DEALLOCATE db_cursor


SELECT Word,COUNT(*)
FROM @WordList
GROUP BY Word 

Upvotes: 1

Related Questions