Derek
Derek

Reputation: 51

TSQL Fuzzy Grouping

Let's say I have rows with descriptions as pasted below:

DAS0_CIP.TRA_HVAC01_073.TRA_HVAC_BCK.TRA_HVAC_CLX.Program:AHU_5.
DAS0_CIP.TRA_HVAC01_072.TRA_HVAC_BCK.TRA_HVAC_CLX.Program:AHU_5.

These are VERY similar but not quite the same. Unfortunately, Soundex won't work to group them and I don't think I can use a two record compare function because I have a table with 7 million records that I need to group by the description. I want similar descriptions to group together. Is there a way to score or rank the similarity of rows? How would I group these together in a better manner?

Right now, I group by the left([description],50) or something like that to get a rough grouping, but it's not really that accurate. I'm figuring there must be a better way.

Extended data example:

[GREDS2UA1381TFT][MR_Data_2]MRSaveData Script:  Sproc failed to Update
!ERR Invalid hpItemArray in AlmLinkedList::AddItemAtEnd
!ERR Invalid hpItemArray in AlmLinkedList::GetItemPtrAt
!ERR Semaphore wait failed in CBThrd_CallbackThreadMain [hCBThreadSemaphore=000001AC]
!WNAL_Register(AlarmPriority10b) - failed(501)
!WNALderegister AlarmMgr not running
!WNAPalarm AlarmMgr not running
.Initialize(Label Details ToteStacker Manual): EntID or AreaName is bad: EntID= 0| AreaName= 
.RefreshGrid(Reprint): No production labels found with filters
.RefreshMaterialsGrid [PRODUCE]:  - Fact.Common.XMLReadBuilder
{1a} MES_CheeseMakeVatL1_AppEngine: StopHostedObjects() - m_AccessManagerProvider->UnregisterAutomationObjectNoUpdate( objectId = 16 ) failed, hr = 0x8000FFFF
{1a} MES_CheeseMakeVatL1_AppEngine: StopHostedObjects() - m_AccessManagerProvider->UnregisterAutomationObjectNoUpdate( objectId = 3 ) failed, hr = 0x80070057
[GRE_CM_REC_OS3][MR_Data_2]MRSaveData Script:  objResult Not Successful Proxy failed to connect to Middleware Server service endpoint. This can happen if your Middleware Server is not available or if the service endpoint portqueue is not accessible. The endpoint is [net.tcp:GRECMESDB1:8099FactMES.Server.HostEventBrokerSvc].
{1a} MES_StarterL1_AppEngine: StopHostedObjects() - m_AccessManagerProvider->UnregisterAutomationObjectNoUpdate( objectId = 8 ) failed, hr = 0x80070057
{1a} MES_StarterL1_AppEngine: StopHostedObjects() - m_AccessManagerProvider->UnregisterAutomationObjectNoUpdate( objectId = 9 ) failed, hr = 0x8000FFFF
0:A2C Attempt to store values in the future; timestamps were overwritten with current time (2542, 20160313 03:31:23.009, 20160313 03:31:23.008) [aahMDASSF.cpp; 3315; 1]
0:A2C Attempt to store values in the future; timestamps were overwritten with current time (2560, 20160313 03:22:55.884, 20160313 03:22:55.883) [aahMDASSF.cpp; 3315; 1]
0:A2C Attempt to store values in the future; timestamps were overwritten with current time (2560, 20160313 03:22:55.884, 20160313 03:22:55.883) [aahMDASSF.cpp; 3315; 3782]
[GREDS2UA1381TFT][MR_Data_2]MRSaveData Script:  Unable to get session instance
0:A2C Attempt to store values in the future; timestamps were overwritten with current time (2579, 20160313 03:27:09.509, 20160313 03:27:09.508) [aahMDASSF.cpp; 3315; 3651]
{1a} MES_StarterL1_AppEngine: StopHostedObjects() - m_AccessManagerProvider->UnregisterAutomationObjectNoUpdate( objectId = 10 ) failed, hr = 0x80070057
0:A2C Attempt to store values in the future; timestamps were overwritten with current time (2579, 20160313 03:35:36.037, 20160313 03:35:36.036) [aahMDASSF.cpp; 3315; 2380]
0:A2C Attempt to store values in the future; timestamps were overwritten with current time (2579, 20160313 03:39:50.371, 20160313 03:39:50.371) [aahMDASSF.cpp; 3315; 1]
{1a} MES_StarterL1_AppEngine: StopHostedObjects() - m_AccessManagerProvider->UnregisterAutomationObjectNoUpdate( objectId = 3 ) failed, hr = 0x80070057
0:A2C Attempt to store values in the future; timestamps were overwritten with current time (2579, 20160313 03:52:29.264, 20160313 03:52:29.264) [aahMDASSF.cpp; 3315; 3598]
0:A2C Attempt to store values in the future; timestamps were overwritten with current time (2579, 20160313 03:56:41.972, 20160313 03:56:41.971) [aahMDASSF.cpp; 3315; 776]
0:A2C Attempt to store values in the future; timestamps were overwritten with current time (2588, 20160313 03:44:03.194, 20160313 03:44:03.194) [aahMDASSF.cpp; 3315; 1]
{1a} MES_SweetWheyL1_AppEngine: StopHostedObjects() - m_AccessManagerProvider->UnregisterAutomationObjectNoUpdate( objectId = 7 ) failed, hr = 0x8000FFFF
0:A2C Attempt to store values in the future; timestamps were overwritten with current time (2588, 20160313 03:44:03.194, 20160313 03:44:03.194) [aahMDASSF.cpp; 3315; 3199]
0:A58 Attempt to store values in the future; timestamps were overwritten with current time (2233, 20160313 03:27:07.759, 20160313 03:27:07.758) [aahMDASSF.cpp; 3315; 1]
0:A2C Attempt to store values in the future; timestamps were overwritten with current time (2542, 20160313 03:31:23.009, 20160313 03:31:23.008) [aahMDASSF.cpp; 3315; 3472]
0:A2C Attempt to store values in the future; timestamps were overwritten with current time (2579, 20160313 03:27:09.509, 20160313 03:27:09.508) [aahMDASSF.cpp; 3315; 1]
0:A2C Attempt to store values in the future; timestamps were overwritten with current time (2579, 20160313 03:35:36.037, 20160313 03:35:36.036) [aahMDASSF.cpp; 3315; 1]
0:A58 Attempt to store values in the future; timestamps were overwritten with current time (2233, 20160313 03:27:07.759, 20160313 03:27:07.758) [aahMDASSF.cpp; 3315; 332]
0:A58 Attempt to store values in the future; timestamps were overwritten with current time (2233, 20160313 03:31:20.290, 20160313 03:31:20.290) [aahMDASSF.cpp; 3315; 1]
0:A58 Attempt to store values in the future; timestamps were overwritten with current time (2233, 20160313 03:35:35.019, 20160313 03:35:35.018) [aahMDASSF.cpp; 3315; 1]
0:A58 Attempt to store values in the future; timestamps were overwritten with current time (2233, 20160313 03:35:35.019, 20160313 03:35:35.018) [aahMDASSF.cpp; 3315; 233]
0:A58 Attempt to store values in the future; timestamps were overwritten with current time (2233, 20160313 03:39:50.959, 20160313 03:39:50.959) [aahMDASSF.cpp; 3315; 253]
0:A58 Attempt to store values in the future; timestamps were overwritten with current time (2233, 20160313 03:44:04.936, 20160313 03:44:04.936) [aahMDASSF.cpp; 3315; 1]
0:A58 Attempt to store values in the future; timestamps were overwritten with current time (2233, 20160313 03:44:04.936, 20160313 03:44:04.936) [aahMDASSF.cpp; 3315; 289]
0:A58 Attempt to store values in the future; timestamps were overwritten with current time (2233, 20160313 03:52:30.967, 20160313 03:52:30.967) [aahMDASSF.cpp; 3315; 1]
0:A2C Attempt to store values in the future; timestamps were overwritten with current time (2579, 20160313 03:39:50.371, 20160313 03:39:50.371) [aahMDASSF.cpp; 3315; 2725]
0:A58 Attempt to store values in the future; timestamps were overwritten with current time (2233, 20160313 03:52:30.967, 20160313 03:52:30.967) [aahMDASSF.cpp; 3315; 326]
0:A2C Attempt to store values in the future; timestamps were overwritten with current time (2579, 20160313 03:48:17.312, 20160313 03:48:17.312) [aahMDASSF.cpp; 3315; 1]
0:A2C Attempt to store values in the future; timestamps were overwritten with current time (2579, 20160313 03:48:17.312, 20160313 03:48:17.312) [aahMDASSF.cpp; 3315; 3683]
0:A2C Attempt to store values in the future; timestamps were overwritten with current time (2579, 20160313 03:52:29.264, 20160313 03:52:29.264) [aahMDASSF.cpp; 3315; 1]
0:A2C Attempt to store values in the future; timestamps were overwritten with current time (2579, 20160313 03:56:41.972, 20160313 03:56:41.971) [aahMDASSF.cpp; 3315; 1]
0:A58 Attempt to store values in the future; timestamps were overwritten with current time (2233, 20160313 03:22:55.228, 20160313 03:22:55.227) [aahMDASSF.cpp; 3315; 1]
0:A58 Attempt to store values in the future; timestamps were overwritten with current time (2233, 20160313 03:22:55.228, 20160313 03:22:55.227) [aahMDASSF.cpp; 3315; 331]
0:A58 Attempt to store values in the future; timestamps were overwritten with current time (2233, 20160313 03:31:20.290, 20160313 03:31:20.290) [aahMDASSF.cpp; 3315; 325]
0:A58 Attempt to store values in the future; timestamps were overwritten with current time (2233, 20160313 03:39:50.959, 20160313 03:39:50.959) [aahMDASSF.cpp; 3315; 1]
0:A58 Attempt to store values in the future; timestamps were overwritten with current time (2233, 20160313 03:48:18.899, 20160313 03:48:18.899) [aahMDASSF.cpp; 3315; 1]
0:A58 Attempt to store values in the future; timestamps were overwritten with current time (2233, 20160313 03:48:18.899, 20160313 03:48:18.899) [aahMDASSF.cpp; 3315; 316]
0:A58 Attempt to store values in the future; timestamps were overwritten with current time (2233, 20160313 03:56:44.589, 20160313 03:56:44.588) [aahMDASSF.cpp; 3315; 1]
0:A5C Attempt to store values in the future; timestamps were overwritten with current time (2100, 20160313 03:35:40.381, 20160313 03:35:40.380) [aahMDASSF.cpp; 3315; 1]
0:A5C Attempt to store values in the future; timestamps were overwritten with current time (2101, 20160313 03:22:59.993, 20160313 03:22:59.993) [aahMDASSF.cpp; 3315; 1]
0:A5C Attempt to store values in the future; timestamps were overwritten with current time (2101, 20160313 03:22:59.993, 20160313 03:22:59.993) [aahMDASSF.cpp; 3315; 1097]
0:A5C Attempt to store values in the future; timestamps were overwritten with current time (2103, 20160313 03:27:13.618, 20160313 03:27:13.618) [aahMDASSF.cpp; 3315; 1]
0:A5C Attempt to store values in the future; timestamps were overwritten with current time (2103, 20160313 03:27:13.618, 20160313 03:27:13.618) [aahMDASSF.cpp; 3315; 1092]
0:A5C Attempt to store values in the future; timestamps were overwritten with current time (2103, 20160313 03:31:27.118, 20160313 03:31:27.118) [aahMDASSF.cpp; 3315; 992]
0:A5C Attempt to store values in the future; timestamps were overwritten with current time (2103, 20160313 03:39:53.131, 20160313 03:39:53.131) [aahMDASSF.cpp; 3315; 1]
0:A5C Attempt to store values in the future; timestamps were overwritten with current time (2103, 20160313 03:39:53.131, 20160313 03:39:53.131) [aahMDASSF.cpp; 3315; 798]
0:A5C Attempt to store values in the future; timestamps were overwritten with current time (2103, 20160313 03:44:07.930, 20160313 03:44:07.930) [aahMDASSF.cpp; 3315; 1008]
0:A5C Attempt to store values in the future; timestamps were overwritten with current time (2103, 20160313 03:48:21.578, 20160313 03:48:21.577) [aahMDASSF.cpp; 3315; 1]
0:A6C Attempt to store values in the future; timestamps were overwritten with current time (2249, 20160313 03:22:55.884, 20160313 03:22:55.883) [aahMDASSF.cpp; 3315; 419]
0:A58 Attempt to store values in the future; timestamps were overwritten with current time (2233, 20160313 03:56:44.589, 20160313 03:56:44.588) [aahMDASSF.cpp; 3315; 66]
0:A6C Attempt to store values in the future; timestamps were overwritten with current time (2249, 20160313 03:31:23.009, 20160313 03:31:23.008) [aahMDASSF.cpp; 3315; 1]
0:A5C Attempt to store values in the future; timestamps were overwritten with current time (2100, 20160313 03:35:40.381, 20160313 03:35:40.380) [aahMDASSF.cpp; 3315; 725]
0:A6C Attempt to store values in the future; timestamps were overwritten with current time (2249, 20160313 03:35:37.598, 20160313 03:35:37.597) [aahMDASSF.cpp; 3315; 1]
0:A5C Attempt to store values in the future; timestamps were overwritten with current time (2103, 20160313 03:31:27.118, 20160313 03:31:27.118) [aahMDASSF.cpp; 3315; 1]
0:A5C Attempt to store values in the future; timestamps were overwritten with current time (2103, 20160313 03:44:07.930, 20160313 03:44:07.930) [aahMDASSF.cpp; 3315; 1]
0:A5C Attempt to store values in the future; timestamps were overwritten with current time (2103, 20160313 03:48:21.578, 20160313 03:48:21.577) [aahMDASSF.cpp; 3315; 1143]
0:A5C Attempt to store values in the future; timestamps were overwritten with current time (2103, 20160313 03:52:37.035, 20160313 03:52:37.034) [aahMDASSF.cpp; 3315; 1]
0:A5C Attempt to store values in the future; timestamps were overwritten with current time (2103, 20160313 03:52:37.035, 20160313 03:52:37.034) [aahMDASSF.cpp; 3315; 1161]
0:A6C Attempt to store values in the future; timestamps were overwritten with current time (2249, 20160313 03:39:53.494, 20160313 03:39:53.493) [aahMDASSF.cpp; 3315; 306]
0:A5C Attempt to store values in the future; timestamps were overwritten with current time (2103, 20160313 03:56:50.535, 20160313 03:56:50.535) [aahMDASSF.cpp; 3315; 1]
0:A5C Attempt to store values in the future; timestamps were overwritten with current time (2103, 20160313 03:56:50.535, 20160313 03:56:50.535) [aahMDASSF.cpp; 3315; 183]
0:A6C Attempt to store values in the future; timestamps were overwritten with current time (2249, 20160313 03:44:05.698, 20160313 03:44:05.698) [aahMDASSF.cpp; 3315; 355]
0:A6C Attempt to store values in the future; timestamps were overwritten with current time (2249, 20160313 03:22:55.884, 20160313 03:22:55.883) [aahMDASSF.cpp; 3315; 1]
0:A6C Attempt to store values in the future; timestamps were overwritten with current time (2249, 20160313 03:48:18.453, 20160313 03:48:18.452) [aahMDASSF.cpp; 3315; 1]
0:A6C Attempt to store values in the future; timestamps were overwritten with current time (2249, 20160313 03:27:09.509, 20160313 03:27:09.508) [aahMDASSF.cpp; 3315; 1]
0:A6C Attempt to store values in the future; timestamps were overwritten with current time (2249, 20160313 03:48:18.453, 20160313 03:48:18.452) [aahMDASSF.cpp; 3315; 390]
0:A6C Attempt to store values in the future; timestamps were overwritten with current time (2249, 20160313 03:27:09.509, 20160313 03:27:09.508) [aahMDASSF.cpp; 3315; 416]
0:A6C Attempt to store values in the future; timestamps were overwritten with current time (2249, 20160313 03:31:23.009, 20160313 03:31:23.008) [aahMDASSF.cpp; 3315; 393]
0:A6C Attempt to store values in the future; timestamps were overwritten with current time (2249, 20160313 03:56:48.585, 20160313 03:56:48.584) [aahMDASSF.cpp; 3315; 71]
0:A6C Attempt to store values in the future; timestamps were overwritten with current time (2249, 20160313 03:35:37.598, 20160313 03:35:37.597) [aahMDASSF.cpp; 3315; 291]
0:A6C Attempt to store values in the future; timestamps were overwritten with current time (2249, 20160313 03:39:53.494, 20160313 03:39:53.493) [aahMDASSF.cpp; 3315; 1]
0:A6C Attempt to store values in the future; timestamps were overwritten with current time (2249, 20160313 03:44:05.698, 20160313 03:44:05.698) [aahMDASSF.cpp; 3315; 1]
0:A6C Attempt to store values in the future; timestamps were overwritten with current time (2288, 20160313 03:42:55.241, 20160313 03:42:55.240) [aahMDASSF.cpp; 3315; 1]
0:A6C Attempt to store values in the future; timestamps were overwritten with current time (2249, 20160313 03:52:32.720, 20160313 03:52:32.719) [aahMDASSF.cpp; 3315; 1]
0:A6C Attempt to store values in the future; timestamps were overwritten with current time (2288, 20160313 03:42:55.241, 20160313 03:42:55.240) [aahMDASSF.cpp; 3315; 639]
0:A6C Attempt to store values in the future; timestamps were overwritten with current time (2294, 20160313 03:38:43.218, 20160313 03:38:43.218) [aahMDASSF.cpp; 3315; 442]
0:A6C Attempt to store values in the future; timestamps were overwritten with current time (2249, 20160313 03:52:32.720, 20160313 03:52:32.719) [aahMDASSF.cpp; 3315; 401]
0:A6C Attempt to store values in the future; timestamps were overwritten with current time (2249, 20160313 03:56:48.585, 20160313 03:56:48.584) [aahMDASSF.cpp; 3315; 1]
0:A6C Attempt to store values in the future; timestamps were overwritten with current time (2338, 20160313 03:55:38.184, 20160313 03:55:38.184) [aahMDASSF.cpp; 3315; 1]
0:A6C Attempt to store values in the future; timestamps were overwritten with current time (2294, 20160313 03:38:43.218, 20160313 03:38:43.218) [aahMDASSF.cpp; 3315; 1]
0:A6C Attempt to store values in the future; timestamps were overwritten with current time (2336, 20160313 03:51:24.094, 20160313 03:51:24.094) [aahMDASSF.cpp; 3315; 1]
0:A6C Attempt to store values in the future; timestamps were overwritten with current time (2344, 20160313 03:34:28.353, 20160313 03:34:28.352) [aahMDASSF.cpp; 3315; 455]
0:A6C Attempt to store values in the future; timestamps were overwritten with current time (2344, 20160313 03:59:50.704, 20160313 03:59:50.703) [aahMDASSF.cpp; 3315; 1]
0:A6C Attempt to store values in the future; timestamps were overwritten with current time (2336, 20160313 03:51:24.094, 20160313 03:51:24.094) [aahMDASSF.cpp; 3315; 841]
0:A6C Attempt to store values in the future; timestamps were overwritten with current time (2338, 20160313 03:55:38.184, 20160313 03:55:38.184) [aahMDASSF.cpp; 3315; 670]
0:A6C Attempt to store values in the future; timestamps were overwritten with current time (2343, 20160313 03:47:09.092, 20160313 03:47:09.091) [aahMDASSF.cpp; 3315; 1]
0:A6C Attempt to store values in the future; timestamps were overwritten with current time (2343, 20160313 03:47:09.092, 20160313 03:47:09.091) [aahMDASSF.cpp; 3315; 658]
A write to InTouch was skipped with a value of  because the data quality was bad or initializing.
A write to InTouch was skipped with a value of CM__Ctrl because the data quality was bad or initializing.
A write to InTouch was skipped with a value of false because the data quality was bad or initializing.

Upvotes: 2

Views: 615

Answers (1)

Gottfried Lesigang
Gottfried Lesigang

Reputation: 67311

One approach might be this: It will test, how many parts are identical... You might continue with this idea into each part splitting this by _.

CREATE FUNCTION dbo.TestSimilarity(@s1 VARCHAR(100),@s2 VARCHAR(100))
RETURNS INT
AS
BEGIN
    DECLARE @x1 XML=CAST('<x>' + REPLACE(@s1,'.','</x><x>') + '</x>' AS XML);
    DECLARE @x2 XML=CAST('<x>' + REPLACE(@s2,'.','</x><x>') + '</x>' AS XML);

    DECLARE @retVal INT=0;
    IF @x1.value('/x[1]','varchar(100)')= @x2.value('/x[1]','varchar(100)') SET @retVal=@retVal+1;
    IF @x1.value('/x[2]','varchar(100)')= @x2.value('/x[2]','varchar(100)') SET @retVal=@retVal+1;
    IF @x1.value('/x[3]','varchar(100)')= @x2.value('/x[3]','varchar(100)') SET @retVal=@retVal+1;
    IF @x1.value('/x[4]','varchar(100)')= @x2.value('/x[4]','varchar(100)') SET @retVal=@retVal+1;
    IF @x1.value('/x[5]','varchar(100)')= @x2.value('/x[5]','varchar(100)') SET @retVal=@retVal+1;

    RETURN @retVal;
END;
GO

SELECT dbo.TestSimilarity('DAS0_CIP.TRA_HVAC01_073.TRA_HVAC_BCK.TRA_HVAC_CLX.Program:AHU_5.','DAS0_CIP.TRA_HVAC01_072.TRA_HVAC_BCK.TRA_HVAC_CLX.Program:AHU_5.');
GO

DROP FUNCTION dbo.TestSimilarity;

The result would be 4...

Upvotes: 1

Related Questions