Reputation: 194
original table contains json, but i've stripped it down to the table below:
id | json |
---|---|
1 | "name":"one.it.two" |
2 | "name": "one.it.two" |
difference between the two rows is the space after :
catalog has no stopwords.
searching for CONTAINS (json, 'it')
return both rows.
searching for CONTAINS (json, 'two')
return both rows.
searching for CONTAINS (json, 'one')
returns only the second row.
why does searching for one
not return the first row?
this is no more a json or delimited text issue.
id | text1 |
---|---|
1 | name:first.it |
2 | name: first.it |
difference between the two rows is the space after :
searching for first
does not return the first row.
search works if i change first.it
to first.and
results from sys.dm_fts_parser
is not consistent:
text | words |
---|---|
name:first.it | name:first.it name :first it |
name:first.and | name first.and first and |
name:first,it | name first it |
i used SELECT * FROM sys.dm_fts_parser ('"<text>"', 1033, NULL, 0)
Upvotes: 2
Views: 339
Reputation: 5165
Based on the info provided in this answer https://dba.stackexchange.com/a/65845/94130
it seems that .it
is treated as a special word (possibly a top level domain) by word breaker.
I can only infer that this "special word logic" has a b̶u̶g̶ feature in it, where :
is treated as part of the name. Examples:
SELECT * FROM sys.dm_fts_parser (' "name:first.it" ', 1033, 0, 0);
SELECT * FROM sys.dm_fts_parser (' "name:first.net" ', 1033, 0, 0);
SELECT * FROM sys.dm_fts_parser (' "name:first.com" ', 1033, 0, 0);
SELECT * FROM sys.dm_fts_parser (' "name:first.gov" ', 1033, 0, 0);
Notice that it always returns one extra result. I assume it includes the extra line when it thinks that part of the string is a URL.
special_term | display_term | expansion_type | source_term |
---|---|---|---|
Exact Match | name:first.gov | 0 | name:first.gov |
Exact Match | name | 0 | name:first.gov |
Exact Match | :first | 0 | name:first.gov |
Exact Match | gov | 0 | name:first.gov |
Note that some words are not affected:
SELECT * FROM sys.dm_fts_parser (' "name:first.he" ', 1033, 0, 0);
I have modified code provided in https://dba.stackexchange.com/a/25848/94130 to get all characters that are treated this way.
declare @i integer
declare @cnt integer
set @i=0
while @i<255
begin
set @cnt=0
select @cnt=COUNT(1) FROM sys.dm_fts_parser ('"name'+REPLACE(CHAR(@i),'"','""')+'first.net"', 1033, 0, 0)
WHERE display_term = CHAR(@i) + 'first'
if @cnt=1
begin
print 'this char - '+CASE WHEN @i > 31 THEN char(@i) ELSE '' END+' - char('+convert(varchar(3),@i)+') is included'
end
set @i=@i+1
end
Output:
this char - - char(0) is included
this char - : - char(58) is included
this char - - char(173) is included
There is a Microsoft article explaining how to switch word breakers/stemmers, which may (or may not) solve this but I have not tried this.
Note: Above code was executed on Win 11 and SQL 2019 Dev
Upvotes: 0