Reputation: 41
I am trying to count words of text that is written in a column of table. Therefor I am using the following query.
SELECT LEN(ExtractedText) -
LEN(REPLACE(ExtractedText, ' ', '')) + 1 from EDDSDBO.Document where ID='100'.
I receive a wrong result that is much to high. On the other hand, if I copy the text directly into the statement then it works, i.e.
SELECT LEN('blablabla text') - LEN(REPLACE('blablabla text', ' ', '')) + 1.
Now the datatype is nvarchar(max)
since the text is very long. I have already tried to convert the column into text
or ntext
and to apply datalength()
instead of len()
. Nevertheless I obtain the same result that it does work as a string but does not work from a table.
Upvotes: 4
Views: 3582
Reputation: 969
You should declare the column using the varchar
data type, like:
create table emp(ename varchar(22));
insert into emp values('amit');
select ename,len(ename) from emp;
output : 4
Upvotes: 0
Reputation: 44316
Replace the spaces with something that never occur in your text like ' $!' or pick another value. then replace all '$! ' and '$!' with nothing this way you never have more than 1 space after a word. Then use your current script. I have defined a word as a space followed by a non-space.
This is an example
DECLARE @T TABLE(COL1 NVARCHAR(2000), ID INT)
INSERT @T VALUES('A B C D', 100)
SELECT LEN(C) - LEN(REPLACE(C,' ', '')) COUNT FROM (
SELECT REPLACE(REPLACE(REPLACE(' ' + COL1, ' ', ' $!'), '$! ',''), '$!', '') C
FROM @T ) A
Here is a recursive solution
DECLARE @T TABLE(COL1 NVARCHAR(2000), ID INT)
INSERT @T VALUES('A B C D', 100)
INSERT @T VALUES('have a nice day with 7 words', 100)
;WITH CTE AS
(
SELECT 1 words, col1 c, col1 FROM @t WHERE id = 100
UNION ALL
SELECT words +1, right(c, len(c) - patindex('% [^ ]%', c)), col1 FROM cte
WHERE patindex('% [^ ]%', c) > 0
)
SELECT words, col1 FROM cte WHERE patindex('% [^ ]%', c) = 0
Upvotes: 1
Reputation: 77657
Leading spaces, trailing spaces, two or more spaces between the neighbouring words – these are the likely causes of the wrong results you are getting.
The functions LTRIM()
and RTRIM()
can help you eliminate the first two issues. As for the third one, you can use REPLACE(ExtractedText, ' ', ' ')
to replace double spaces with single ones, but I'm not sure if you do not have triple ones (in which case you'd need to repeat the replacing).
UPDATE
Here's a UDF that uses CTEs and ranking to eliminate extra spaces and then counts the remaining ones to return the quantity as the number of words:
CREATE FUNCTION fnCountWords (@Str varchar(max))
RETURNS int
AS BEGIN
DECLARE @xml xml, @res int;
SET @Str = RTRIM(LTRIM(@Str));
WITH split AS (
SELECT
idx = number,
chr = SUBSTRING(@Str, number, 1)
FROM master..spt_values
WHERE type = 'P'
AND number BETWEEN 1 AND LEN(@Str)
),
ranked AS (
SELECT
idx,
chr,
rnk = idx - ROW_NUMBER() OVER (PARTITION BY chr ORDER BY idx)
FROM split
)
SELECT @res = COUNT(DISTINCT rnk) + 1
FROM ranked
WHERE chr = ' ';
RETURN @res;
END
With this function your query will be simply like this:
SELECT fnCountWords(ExtractedText)
FROM EDDSDBO.Document
WHERE ID='100'
UPDATE 2
The function uses one of the system tables, master..spt_values
, as a tally table. The particular subset used contains only values from 0 to 2047. This means the function will not work correctly for inputs longer than 2047 characters (after trimming both leading and trailing spaces), as @t-clausen.dk has correctly noted in his comment. Therefore, a custom tally table should be used if longer input strings are possible.
Upvotes: 1
Reputation: 23972
You're counting spaces not words. That will typically yield an approximate answer.
e.g.
' this string will give an incorrect result '
Try this approach: http://www.sql-server-helper.com/functions/count-words.aspx
CREATE FUNCTION [dbo].[WordCount] ( @InputString VARCHAR(4000) )
RETURNS INT
AS
BEGIN
DECLARE @Index INT
DECLARE @Char CHAR(1)
DECLARE @PrevChar CHAR(1)
DECLARE @WordCount INT
SET @Index = 1
SET @WordCount = 0
WHILE @Index <= LEN(@InputString)
BEGIN
SET @Char = SUBSTRING(@InputString, @Index, 1)
SET @PrevChar = CASE WHEN @Index = 1 THEN ' '
ELSE SUBSTRING(@InputString, @Index - 1, 1)
END
IF @PrevChar = ' ' AND @Char != ' '
SET @WordCount = @WordCount + 1
SET @Index = @Index + 1
END
RETURN @WordCount
END
GO
usage
DECLARE @String VARCHAR(4000)
SET @String = 'Health Insurance is an insurance against expenses incurred through illness of the insured.'
SELECT [dbo].[WordCount] ( @String )
Upvotes: 2