hmad
hmad

Reputation: 194

issue with MSSQL fulltext search

original table contains json, but i've stripped it down to the table below:

id json
1 "name":"one.it.two"
2 "name": "one.it.two"

difference between the two rows is the space after :
catalog has no stopwords.
searching for CONTAINS (json, 'it') return both rows.
searching for CONTAINS (json, 'two') return both rows.
searching for CONTAINS (json, 'one') returns only the second row.

why does searching for one not return the first row?


i've reduced the test case even further. thanks to @RobinWebb

this is no more a json or delimited text issue.

id text1
1 name:first.it
2 name: first.it

difference between the two rows is the space after :

searching for first does not return the first row.

search works if i change first.it to first.and


thanks to @AlwaysLearning, this is an issue with the word breaker

results from sys.dm_fts_parser is not consistent:

text words
name:first.it name:first.it
name
:first
it
name:first.and name
first.and
first
and
name:first,it name
first
it

i used SELECT * FROM sys.dm_fts_parser ('"<text>"', 1033, NULL, 0)

Upvotes: 2

Views: 339

Answers (1)

Alex
Alex

Reputation: 5165

Based on the info provided in this answer https://dba.stackexchange.com/a/65845/94130 it seems that .it is treated as a special word (possibly a top level domain) by word breaker.

I can only infer that this "special word logic" has a b̶u̶g̶ feature in it, where : is treated as part of the name. Examples:

SELECT * FROM sys.dm_fts_parser (' "name:first.it" ', 1033, 0, 0);
SELECT * FROM sys.dm_fts_parser (' "name:first.net" ', 1033, 0, 0);
SELECT * FROM sys.dm_fts_parser (' "name:first.com" ', 1033, 0, 0);
SELECT * FROM sys.dm_fts_parser (' "name:first.gov" ', 1033, 0, 0);

Notice that it always returns one extra result. I assume it includes the extra line when it thinks that part of the string is a URL.

special_term display_term expansion_type source_term
Exact Match name:first.gov 0 name:first.gov
Exact Match name 0 name:first.gov
Exact Match :first 0 name:first.gov
Exact Match gov 0 name:first.gov

Note that some words are not affected:

SELECT * FROM sys.dm_fts_parser (' "name:first.he" ', 1033, 0, 0);

I have modified code provided in https://dba.stackexchange.com/a/25848/94130 to get all characters that are treated this way.

declare @i integer
declare @cnt integer
set @i=0
while @i<255
begin
  set @cnt=0
  select @cnt=COUNT(1) FROM sys.dm_fts_parser ('"name'+REPLACE(CHAR(@i),'"','""')+'first.net"', 1033, 0, 0)
  WHERE display_term = CHAR(@i) + 'first'
  if @cnt=1
    begin
      print 'this char - '+CASE WHEN @i > 31 THEN char(@i) ELSE '' END+' - char('+convert(varchar(3),@i)+') is included'
    end
  set @i=@i+1
end 

Output:

this char -  - char(0) is included
this char - : - char(58) is included
this char - ­ - char(173) is included

There is a Microsoft article explaining how to switch word breakers/stemmers, which may (or may not) solve this but I have not tried this.

Note: Above code was executed on Win 11 and SQL 2019 Dev

Upvotes: 0

Related Questions