Reputation: 79
I want to filter out several key phrases out of a dataset. Unfortunately, the only algorithm I've been able to come up with thus far is nested replace statements, such as:
SELECT
REPLACE(
REPLACE(FIELDNAME,'</u>','')
,'<u>','')
where FIELDNAME is raw HTML code stored in a table. As you can see, this is hideous. Any better ideas?
Upvotes: 1
Views: 347
Reputation: 56735
Here is a discussion of a similar but more complicated problem of "Cleaning" HTML text according to a list of rules (you need to register at this site, but that's it). Included in the discussion are several ways to do it with T-SQL and one way to do it, the fastest by far, with SQLCLR. As I wrote the VB.Net/SQLCLR solution, I am including it below.
Here are the text Replacements/Transforms that it implements:
The which implements a text transducer type of DFSA (Deterministic Finite State Automaton, well, it's almost deterministic, because it does look-ahead in a couple of places):
Imports System
Imports System.Data
Imports System.Data.SqlClient
Imports System.Data.SqlTypes
Imports Microsoft.SqlServer.Server
Partial Public Class UserDefinedFunctions
Public Enum States
Space1
Entity
HTMLTag
Norm
Word1
Script
Style
End Enum
Enum SubStates
None
EndBegin
EndSlash
End Enum
Const CharSpace As Integer = 32
Const CharAmp As Integer = 38
Const CharSlash As Integer = 47
Const CharLT As Integer = 60
Const CharGT As Integer = 62
Const CharA As Integer = 65
Const CharX As Integer = 88
Const CharZ As Integer = 90
Const Char_a As Integer = 97
Const Char_b As Integer = 98
Const Char_n As Integer = 110
Const Char_p As Integer = 112
Const Char_s As Integer = 115
Const Char_z As Integer = 122
Const CharDash As Integer = 45
Const CharSemiC As Integer = 59
<Microsoft.SqlServer.Server.SqlFunction( _
DataAccess:=DataAccessKind.None _
, IsDeterministic:=True _
, IsPrecise:=True)> _
Public Shared Function HTMLCleaner(ByVal chars As SqlTypes.SqlBytes) As SqlTypes.SqlBytes
Dim b As Byte
Dim i As Integer, j As Integer
Dim Out As Byte()
Dim State As States = States.Space1
Dim Substate As SubStates = SubStates.None
Dim strAccum As String = ""
ReDim Out(0 To chars.Length - 1)
For i = 0 To chars.Length - 1
b = chars(i)
Select Case State
Case States.Norm
Select Case b
Case CharA To CharZ, Char_a To Char_z
Out(j) = b
j = j + 1
State = States.Norm
Case CharSpace
Out(j) = b
j = j + 1
State = States.Space1
Case CharAmp
State = States.Entity 'skip output'
Case CharLT
Out(j) = CharSpace
j = j + 1
State = States.HTMLTag
Case CharDash
Out(j) = b
j = j + 1
State = States.Norm
Case Else
State = States.Norm 'skip output'
End Select
Case States.Space1
Select Case b
Case CharSpace
'discard leading & multiple spaces'
Case CharAmp
State = States.Entity 'skip output'
Case CharLT
Out(j) = CharSpace
j = j + 1
State = States.HTMLTag
Case CharDash
Out(j) = b
j = j + 1
State = States.Norm
Case CharA To CharZ, Char_a To Char_z
Out(j) = b
j = j + 1
State = States.Word1
Case Else
State = States.Norm 'skip output'
End Select
Case States.Word1
Select Case b
Case CharSpace
'single char word, retract from output:'
j = j - 1
State = States.Space1
Case CharAmp
State = States.Entity 'skip output'
Case CharLT
Out(j) = CharSpace
j = j + 1
State = States.HTMLTag
Case CharDash
Out(j) = b
j = j + 1
State = States.Norm
Case CharA To CharZ, Char_a To Char_z
Out(j) = b
j = j + 1
State = States.Norm
Case Else
State = States.Norm 'skip output'
End Select
Case States.Entity
Select Case b
Case CharSemiC
'End of entity, wrap it up:'
If strAccum = "nbsp" Then
Out(j) = CharSpace
j = j + 1
strAccum = ""
State = States.Space1
Else
'output "X"'
Out(j) = CharX
j = j + 1
State = States.Norm
End If
Case Else
'else, keep scanning for semicolon...'
' accumulate entity chars:'
strAccum = strAccum & b
End Select
Case States.HTMLTag
If b = CharGT Then
If strAccum = "SCRIPT" Then
strAccum = ""
State = States.Script
Substate = SubStates.None
ElseIf strAccum = "STYLE" Then
strAccum = ""
State = States.Style
Substate = SubStates.None
Else
Out(j) = CharSpace
j = j + 1
State = States.Space1
strAccum = ""
End If
Else
'accumulate tag name'
strAccum = strAccum & b
End If
Case States.Script
Select Case Substate
Case SubStates.None
If b = CharGT Then
Substate = SubStates.EndBegin
End If
Case SubStates.EndBegin
If b = CharSlash Then
Substate = SubStates.EndSlash
strAccum = ""
Else
Substate = SubStates.None
End If
Case SubStates.EndSlash
If b = CharGT Then
If strAccum = "SCRIPT" Then
'end of script found; output nothing'
State = States.Norm
Substate = SubStates.None
Else
'false alarm, back to script-scanning'
Substate = SubStates.None
End If
Else
'accumulate the end-tags label'
strAccum = strAccum & b
End If
End Select
Case States.Style
Select Case Substate
Case SubStates.None
If b = CharGT Then
Substate = SubStates.EndBegin
End If
Case SubStates.EndBegin
If b = CharSlash Then
Substate = SubStates.EndSlash
strAccum = ""
Else
Substate = SubStates.None
End If
Case SubStates.EndSlash
If b = CharGT Then
If strAccum = "STYLE" Then
'end of script found; output nothing'
State = States.Norm
Substate = SubStates.None
Else
'false alarm, back to script-scanning'
Substate = SubStates.None
End If
Else
'accumulate the end-tags label'
strAccum = strAccum & b
End If
End Select
Case Else
Debug.Assert(1 = 0)
End Select
'extra check for multiple spaces'
If j > 1 _
AndAlso (Out(j - 1) = CharSpace _
And Out(j - 2) = CharSpace) Then
j = j - 1 'roll back the last character'
ElseIf j = 1 AndAlso Out(0) = CharSpace Then
j = 0 'overwrite leading space'
End If
Next
'remove any trailing space:'
If j > 0 AndAlso Out(j - 1) = CharSpace Then j = j - 1
'trim off the trailing excess'
ReDim Preserve Out(0 To j - 1)
Return New SqlBytes(Out)
End Function
<Microsoft.SqlServer.Server.SqlFunction( _
DataAccess:=DataAccessKind.None _
, IsDeterministic:=True _
, IsPrecise:=True)> _
Public Shared Function HTMLCopy2(ByVal chars As SqlTypes.SqlBytes) As SqlTypes.SqlBytes
Dim out() As Byte
ReDim out(0 To chars.Length - 1)
For i As Integer = 0 To chars.Length - 1
out(i) = chars.Buffer(i)
Next
Return New SqlBytes(out)
End Function
<Microsoft.SqlServer.Server.SqlFunction( _
DataAccess:=DataAccessKind.None _
, IsDeterministic:=True _
, IsPrecise:=True)> _
Public Shared Function HTMLCopy(ByVal chars As SqlTypes.SqlBytes) As SqlTypes.SqlBytes
' Add your code here'
Return New SqlTypes.SqlBytes(chars.Buffer)
End Function
End Class
Upvotes: 1
Reputation: 294287
Such string manipulations are best handled by CLR scalar valued functions.
Upvotes: 1
Reputation: 23064
I don't think there's a better way in TSQL.
If you've got another environment on top of the SQL layer (e.g asp.net) you might have more luck doing the filtering in that.
Upvotes: 3