Reputation: 4766
I am trying to use a regular expression to do an accent insensitive search on an accent sensitive column.
I cannot change the column to an accent insensitive collation because the column is unique and we need to allow both 'Jose' and 'José' to be entered.
I cannot coerce the collation within my query because the table holds a ton of data and coercing it will cause it to do an index scan instead of a seek and the query times out
I cannot add a new non-unique column that uses an accent insensitive collation. This would avoid the 2 issues above because the original column could be accent sensitive and unique and I would be able to do the search on the new column and use the index on it, however I'm being told I can't do this because it would cause data duplication.
So, I'm trying to do this using a regular expression search on the accent sensitive column. In my application I take in whatever string the user entered and I modify the string in such a way that for this input
"Jose"
I get the output
"J[òóôõöø][sš][eèéêë]%"
Using this string to search works. It finds "Jose" and "José" and it searches with the correct index and doesn't time out.
The only issue I have now is with the darn æ character.
If the database contains the values "aéro" & "æro" and the user enters "aero", currently my application will generate the search strings "[aàáâãäå][eèéêë]r[oòóôõöø]%"
But this will only be a match for "aéro". It will not match "æro". By default in sql server it already treats "ae" the same as "æ" whether the collation is accent sensitive or insensitive, so not modifying the user's input would return "æro" as a match, but would not return "aéro" as a match.
Does anyone have any ideas how I can do a regular expression search in SQL that will match both "aé" and "æ"?
SELECT *
FROM mytable
WHERE name LIKE '[aàáâãäå][eèéêë]%'
Upvotes: 2
Views: 151
Reputation: 1739
Unfortunately, SQL Server LIKE
statement does not support the full regex functionality. You can however achieve the result you are looking for by modifying your current query as follow:
DECLARE @primarySearch VARCHAR(100) = '[aàáâãäå][eèéêë]r[oòóôõöø]%';
DECLARE @secondarySearchKey VARCHAR(100) = REPLACE(@primarySearch, '[aàáâãäå][eèéêë]', 'æ')
SELECT *
FROM mytable
WHERE Name LIKE @primarySearch OR Name LIKE @secondarySearchKey
In the modified version we add a second OR comparison check that simply replaced [aàáâãäå][eèéêë] combination with the æ character. This will add some additional overhead to your query but I doubt it will be significant.
UPDATE:
The above solution will not work for multiple occurrence [aàáâãäå][eèéêë] combination as per Nick’s comments. You could however perform an initial wider search that only matches the pattern up to the first occurrence of the [aàáâãäå][eèéêë] character combination. Once the initial match is perform you can normalize the text by replacing the occurrence of the ‘æ’ character with ‘ae’ string. Once this is done you can perform the refined search as per following example:
DECLARE @primarySearch VARCHAR(100) = '[aàáâãäå][eèéêë]_[aàáâãäå][eèéêë]_[aàáâãäå][eèéêë]%';
IF (@primarySearch LIKE '%\[aàáâãäå\]\[eèéêë\]%' ESCAPE '\') OR @primarySearch LIKE '%æ%'
BEGIN
DECLARE @firstOccurrence INT = CHARINDEX('[aàáâãäå][eèéêë]', @primarySearch);
DECLARE @initialSearchKey VARCHAR(100) = LEFT(@primarySearch, @firstOccurrence + LEN('[aàáâãäå][eèéêë]') - 1) + '%'
DECLARE @initialSearchSpecialCaseKey VARCHAR(100) = REPLACE(@initialSearchKey, '[aàáâãäå][eèéêë]', 'æ');
SELECT @initialSearchKey AS IntialKey;
SELECT @initialSearchSpecialCaseKey AS SpecialCaseKey ;
WITH InitialSearch AS
(
SELECT Name
,REPLACE(Name, 'æ', 'ae') AS NormalizedName
FROM mytable
WHERE Name LIKE @initialSearchKey OR Name LIKE @initialSearchSpecialCaseKey
)
SELECT Name
FROM InitialSearch
WHERE NormalizedName LIKE @primarySearch
END
ELSE
BEGIN
SELECT Name
FROM mytable
WHERE Name LIKE @primarySearch
END
Upvotes: 2