Reputation: 12829
I have a table that lists a freet text input from a survey where enterents were allowed to enter their responses (regarding colours they would like to have in their wedding)
I would like to write a sql function that gathers all the information from this column, and orders counts the frequency of each word, ordering the result set by this count.
Response
--------
Red and White
green
White and blue
Blue
Dark blue
I would like the above table to be ordered as follows
Response Frequency
-------- ---------
Blue 3
White 2
And 2
Red 1
Green 1
I can strip all the rubbish words like "and" after the function has run. Does anyone know any good functions that produce this behaviour?
Upvotes: 4
Views: 12318
Reputation: 10908
DECLARE @phrases TABLE (id int, phrase varchar(max))
INSERT @phrases values
(1,'Red and White' ),
(2,'green' ),
(3,'White and blue' ),
(4,'Blue' ),
(5,'Dark blue' );
SELECT word, COUNT(*) c
FROM @phrases
CROSS APPLY (SELECT CAST('<a>'+REPLACE(phrase,' ','</a><a>')+'</a>' AS xml) xml1 ) t1
CROSS APPLY (SELECT n.value('.','varchar(max)') AS word FROM xml1.nodes('a') x(n) ) t2
GROUP BY word
word freq ----------- ----------- and 2 blue 3 Dark 1 green 1 Red 1 White 2
Upvotes: 2
Reputation: 12829
Okay this works a treat. Firstly a function to separate the values...
Alter Function dbo.SeparateValues
(
@data VARCHAR(MAX),
@delimiter VARCHAR(10)
)
RETURNS
@tbldata TABLE(col VARCHAR(MAX))
As
--Declare @data VARCHAR(MAX) ,@delimiter VARCHAR(10)
--Declare @tbldata TABLE(col VARCHAR(10))
--Set @data = 'hello,how,are,you?,234234'
--Set @delimiter = ','
--DECLARE @tbl TABLE(col VARCHAR(10))
Begin
DECLARE @pos INT
DECLARE @prevpos INT
SET @pos = 1
SET @prevpos = 0
WHILE @pos > 0
BEGIN
SET @pos = CHARINDEX(@delimiter, @data, @prevpos+1)
if @pos > 0
INSERT INTO @tbldata(col) VALUES(LTRIM(RTRIM(SUBSTRING(@data, @prevpos+1, @pos-@prevpos-1))))
else
INSERT INTO @tbldata(col) VALUES(LTRIM(RTRIM(SUBSTRING(@data, @prevpos+1, len(@data)-@prevpos))))
SET @prevpos = @pos
End
RETURN
END
then I just apply it to my table...
Select Count(*), sep.Col FROM (
Select * FROM (
Select value = Upper(RTrim(LTrim(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(response, ',', ' '), '.', ' '), '!', ' '), '+', ' '), ':', ' '), '-', ' '), ';', ' '), '(', ' '), ')', ' '), '/', ' '), '&', ''), '?', ' '), ' ', ' '), ' ', ' ')))) FROM Responses
) easyValues
Where value <> ''
) actualValues
Cross Apply dbo.SeparateValues(value, ' ') sep
Group By sep.Col
Order By Count(*) Desc
Okay, so I went OTT with my nested tables, but I've stripped out all the crap characters, separated the values and kept a running total of the most frequently used words.
Upvotes: 4
Reputation: 44268
You're main problem is that you're missing a split function in SQL Server.
Theres a sample one here that looks pretty good..
http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=50648
Using that, you write a stored proc along the lines of...
CREATE TABLE #Temp (Response nvarchar(50), Frequency int)
DECLARE @response nvarchar(100)
DECLARE db_cursor CURSOR FOR
SELECT response FROM YourTable
OPEN db_cursor
FETCH NEXT FROM db_cursor INTO @response
WHILE @@FETCH_STATUS = 0
BEGIN
/* Pseudo Code */
--Split @Response
--Iterate through each word in returned list
--IF(EXISTS in #TEMP)
-- UPDATE THAT ROW & INCREMENT THE FREQUENCY
--ELSE
-- NEW WORD, INSERT TO #Temp WITH A FREQUENCY OF 1
FETCH NEXT FROM db_cursor INTO @response
END
SELECT * FROM #Temp
Theres probably a less fugly way to do this without cursors, but if it's just something you need to run once, and you're table or Responses isn't phenomenally large, then this should work
Upvotes: 1