berl13
berl13

Reputation: 41

SQL: problem word count with len()

I am trying to count words of text that is written in a column of table. Therefor I am using the following query.

SELECT LEN(ExtractedText) - 
LEN(REPLACE(ExtractedText, ' ', '')) + 1 from EDDSDBO.Document where ID='100'.

I receive a wrong result that is much to high. On the other hand, if I copy the text directly into the statement then it works, i.e.

SELECT LEN('blablabla text') - LEN(REPLACE('blablabla text', ' ', '')) + 1.

Now the datatype is nvarchar(max) since the text is very long. I have already tried to convert the column into text or ntext and to apply datalength() instead of len(). Nevertheless I obtain the same result that it does work as a string but does not work from a table.

Upvotes: 4

Views: 3582

Answers (4)

Er. Mohammad Jahangeer
Er. Mohammad Jahangeer

Reputation: 969

You should declare the column using the varchar data type, like:

create table emp(ename varchar(22));

insert into emp values('amit');

select ename,len(ename) from emp;

output : 4

Upvotes: 0

t-clausen.dk
t-clausen.dk

Reputation: 44316

Replace the spaces with something that never occur in your text like ' $!' or pick another value. then replace all '$! ' and '$!' with nothing this way you never have more than 1 space after a word. Then use your current script. I have defined a word as a space followed by a non-space.

This is an example

DECLARE @T TABLE(COL1 NVARCHAR(2000), ID INT)

INSERT @T VALUES('A B  C   D', 100)

SELECT LEN(C) - LEN(REPLACE(C,' ', '')) COUNT FROM (
SELECT REPLACE(REPLACE(REPLACE(' ' + COL1, ' ', ' $!'), '$! ',''), '$!', '') C
FROM @T ) A

Here is a recursive solution

DECLARE @T TABLE(COL1 NVARCHAR(2000), ID INT)

INSERT @T VALUES('A B  C   D', 100)
INSERT @T VALUES('have a nice day with 7 words', 100)

;WITH CTE AS
(
SELECT 1 words, col1 c, col1 FROM @t WHERE id = 100
UNION ALL
SELECT words +1, right(c, len(c) - patindex('% [^ ]%', c)), col1 FROM cte
WHERE patindex('% [^ ]%', c) > 0
)
SELECT words, col1 FROM cte WHERE patindex('% [^ ]%', c) = 0

Upvotes: 1

Andriy M
Andriy M

Reputation: 77657

Leading spaces, trailing spaces, two or more spaces between the neighbouring words – these are the likely causes of the wrong results you are getting.

The functions LTRIM() and RTRIM() can help you eliminate the first two issues. As for the third one, you can use REPLACE(ExtractedText, ' ', ' ') to replace double spaces with single ones, but I'm not sure if you do not have triple ones (in which case you'd need to repeat the replacing).


UPDATE

Here's a UDF that uses CTEs and ranking to eliminate extra spaces and then counts the remaining ones to return the quantity as the number of words:

CREATE FUNCTION fnCountWords (@Str varchar(max))
RETURNS int
AS BEGIN
  DECLARE @xml xml, @res int;
  SET @Str = RTRIM(LTRIM(@Str));
  WITH split AS (
    SELECT
      idx = number,
      chr = SUBSTRING(@Str, number, 1)
    FROM master..spt_values
    WHERE type = 'P'
      AND number BETWEEN 1 AND LEN(@Str)
  ),
  ranked AS (
    SELECT
      idx,
      chr,
        rnk = idx - ROW_NUMBER() OVER (PARTITION BY chr ORDER BY idx)
      FROM split
  )
  SELECT @res = COUNT(DISTINCT rnk) + 1
  FROM ranked
  WHERE chr = ' ';
  RETURN @res;
END

With this function your query will be simply like this:

SELECT fnCountWords(ExtractedText)
FROM EDDSDBO.Document
WHERE ID='100'

UPDATE 2

The function uses one of the system tables, master..spt_values, as a tally table. The particular subset used contains only values from 0 to 2047. This means the function will not work correctly for inputs longer than 2047 characters (after trimming both leading and trailing spaces), as @t-clausen.dk has correctly noted in his comment. Therefore, a custom tally table should be used if longer input strings are possible.

Upvotes: 1

Code Magician
Code Magician

Reputation: 23972

You're counting spaces not words. That will typically yield an approximate answer.

e.g.

'  this     string    will    give an    incorrect      result   '

Try this approach: http://www.sql-server-helper.com/functions/count-words.aspx

CREATE FUNCTION [dbo].[WordCount] ( @InputString VARCHAR(4000) ) 
RETURNS INT
AS
BEGIN

DECLARE @Index          INT
DECLARE @Char           CHAR(1)
DECLARE @PrevChar       CHAR(1)
DECLARE @WordCount      INT

SET @Index = 1
SET @WordCount = 0

WHILE @Index <= LEN(@InputString)
BEGIN
    SET @Char     = SUBSTRING(@InputString, @Index, 1)
    SET @PrevChar = CASE WHEN @Index = 1 THEN ' '
                         ELSE SUBSTRING(@InputString, @Index - 1, 1)
                    END

    IF @PrevChar = ' ' AND @Char != ' '
        SET @WordCount = @WordCount + 1

    SET @Index = @Index + 1
END

RETURN @WordCount

END
GO

usage

DECLARE @String VARCHAR(4000)
SET @String = 'Health Insurance is an insurance against expenses incurred through illness of the insured.'

SELECT [dbo].[WordCount] ( @String )

Upvotes: 2

Related Questions