DaWolfman
DaWolfman

Reputation: 79

SQL Text Filtering

I want to filter out several key phrases out of a dataset. Unfortunately, the only algorithm I've been able to come up with thus far is nested replace statements, such as:

SELECT 
    REPLACE(
            REPLACE(FIELDNAME,'</u>','')
            ,'<u>','')

where FIELDNAME is raw HTML code stored in a table. As you can see, this is hideous. Any better ideas?

Upvotes: 1

Views: 347

Answers (3)

RBarryYoung
RBarryYoung

Reputation: 56735

Here is a discussion of a similar but more complicated problem of "Cleaning" HTML text according to a list of rules (you need to register at this site, but that's it). Included in the discussion are several ways to do it with T-SQL and one way to do it, the fastest by far, with SQLCLR. As I wrote the VB.Net/SQLCLR solution, I am including it below.

Here are the text Replacements/Transforms that it implements:

  1. Remove Script tags and contents
  2. Replace all HTML tags with spaces.
  3. Replace with a space
  4. Replace all entity codes (“&xxx;”) with X
  5. Replace all punctuation and math symbols (.,;:’”&()[]+/<>≥≤°÷) with a space (dashes are not replaced) Add percent, back slash, underscore, carat, asterisk, equal sign, curly braces, question and exclamation marks, pipes, dollar and cents signs, pound sign, tabs, crlf
  6. Replace all digits with a space
  7. Replace all one letter words with a space (pattern: space single-character-wildcard space)
  8. Remove redundant spaces.

The which implements a text transducer type of DFSA (Deterministic Finite State Automaton, well, it's almost deterministic, because it does look-ahead in a couple of places):

Imports System
Imports System.Data
Imports System.Data.SqlClient
Imports System.Data.SqlTypes
Imports Microsoft.SqlServer.Server

Partial Public Class UserDefinedFunctions

    Public Enum States
        Space1
        Entity
        HTMLTag
        Norm
        Word1
        Script
        Style
    End Enum
    Enum SubStates
        None
        EndBegin
        EndSlash
    End Enum
    Const CharSpace As Integer = 32
    Const CharAmp As Integer = 38
    Const CharSlash As Integer = 47
    Const CharLT As Integer = 60
    Const CharGT As Integer = 62
    Const CharA As Integer = 65
    Const CharX As Integer = 88
    Const CharZ As Integer = 90
    Const Char_a As Integer = 97
    Const Char_b As Integer = 98
    Const Char_n As Integer = 110
    Const Char_p As Integer = 112
    Const Char_s As Integer = 115
    Const Char_z As Integer = 122
    Const CharDash As Integer = 45
    Const CharSemiC As Integer = 59

    <Microsoft.SqlServer.Server.SqlFunction( _
            DataAccess:=DataAccessKind.None _
            , IsDeterministic:=True _
            , IsPrecise:=True)> _
    Public Shared Function HTMLCleaner(ByVal chars As SqlTypes.SqlBytes) As SqlTypes.SqlBytes
        Dim b As Byte
        Dim i As Integer, j As Integer
        Dim Out As Byte()
        Dim State As States = States.Space1
        Dim Substate As SubStates = SubStates.None
        Dim strAccum As String = ""

        ReDim Out(0 To chars.Length - 1)

        For i = 0 To chars.Length - 1
            b = chars(i)
            Select Case State

                Case States.Norm
                    Select Case b
                        Case CharA To CharZ, Char_a To Char_z
                            Out(j) = b
                            j = j + 1
                            State = States.Norm
                        Case CharSpace
                            Out(j) = b
                            j = j + 1
                            State = States.Space1
                        Case CharAmp
                            State = States.Entity 'skip output'
                        Case CharLT
                            Out(j) = CharSpace
                            j = j + 1
                            State = States.HTMLTag
                        Case CharDash
                            Out(j) = b
                            j = j + 1
                            State = States.Norm
                        Case Else
                            State = States.Norm 'skip output'
                    End Select

                Case States.Space1
                    Select Case b
                        Case CharSpace
                            'discard leading & multiple spaces'
                        Case CharAmp
                            State = States.Entity 'skip output'
                        Case CharLT
                            Out(j) = CharSpace
                            j = j + 1
                            State = States.HTMLTag
                        Case CharDash
                            Out(j) = b
                            j = j + 1
                            State = States.Norm
                        Case CharA To CharZ, Char_a To Char_z
                            Out(j) = b
                            j = j + 1
                            State = States.Word1
                        Case Else
                            State = States.Norm 'skip output'
                    End Select

                Case States.Word1
                    Select Case b
                        Case CharSpace
                            'single char word, retract from output:'
                            j = j - 1
                            State = States.Space1
                        Case CharAmp
                            State = States.Entity 'skip output'
                        Case CharLT
                            Out(j) = CharSpace
                            j = j + 1
                            State = States.HTMLTag
                        Case CharDash
                            Out(j) = b
                            j = j + 1
                            State = States.Norm
                        Case CharA To CharZ, Char_a To Char_z
                            Out(j) = b
                            j = j + 1
                            State = States.Norm
                        Case Else
                            State = States.Norm 'skip output'
                    End Select

                Case States.Entity
                    Select Case b
                        Case CharSemiC
                            'End of entity, wrap it up:'
                            If strAccum = "nbsp" Then
                                Out(j) = CharSpace
                                j = j + 1
                                strAccum = ""
                                State = States.Space1
                            Else
                                'output "X"'
                                Out(j) = CharX
                                j = j + 1
                                State = States.Norm
                            End If
                        Case Else
                            'else, keep scanning for semicolon...'
                            ' accumulate entity chars:'
                            strAccum = strAccum & b
                    End Select

                Case States.HTMLTag
                    If b = CharGT Then
                        If strAccum = "SCRIPT" Then
                            strAccum = ""
                            State = States.Script
                            Substate = SubStates.None
                        ElseIf strAccum = "STYLE" Then
                            strAccum = ""
                            State = States.Style
                            Substate = SubStates.None
                        Else
                            Out(j) = CharSpace
                            j = j + 1
                            State = States.Space1
                            strAccum = ""
                        End If
                    Else
                        'accumulate tag name'
                        strAccum = strAccum & b
                    End If

                Case States.Script
                    Select Case Substate
                        Case SubStates.None
                            If b = CharGT Then
                                Substate = SubStates.EndBegin
                            End If
                        Case SubStates.EndBegin
                            If b = CharSlash Then
                                Substate = SubStates.EndSlash
                                strAccum = ""
                            Else
                                Substate = SubStates.None
                            End If
                        Case SubStates.EndSlash
                            If b = CharGT Then
                                If strAccum = "SCRIPT" Then
                                    'end of script found; output nothing'
                                    State = States.Norm
                                    Substate = SubStates.None
                                Else
                                    'false alarm, back to script-scanning'
                                    Substate = SubStates.None
                                End If
                            Else
                                'accumulate the end-tags label'
                                strAccum = strAccum & b
                            End If
                    End Select

                Case States.Style
                    Select Case Substate
                        Case SubStates.None
                            If b = CharGT Then
                                Substate = SubStates.EndBegin
                            End If
                        Case SubStates.EndBegin
                            If b = CharSlash Then
                                Substate = SubStates.EndSlash
                                strAccum = ""
                            Else
                                Substate = SubStates.None
                            End If
                        Case SubStates.EndSlash
                            If b = CharGT Then
                                If strAccum = "STYLE" Then
                                    'end of script found; output nothing'
                                    State = States.Norm
                                    Substate = SubStates.None
                                Else
                                    'false alarm, back to script-scanning'
                                    Substate = SubStates.None
                                End If
                            Else
                                'accumulate the end-tags label'
                                strAccum = strAccum & b
                            End If
                    End Select

                Case Else
                    Debug.Assert(1 = 0)
            End Select

            'extra check for multiple spaces'
            If j > 1 _
                AndAlso (Out(j - 1) = CharSpace _
                And Out(j - 2) = CharSpace) Then
                j = j - 1   'roll back the last character'
            ElseIf j = 1 AndAlso Out(0) = CharSpace Then
                j = 0   'overwrite leading space'
            End If

        Next

        'remove any trailing space:'
        If j > 0 AndAlso Out(j - 1) = CharSpace Then j = j - 1
        'trim off the trailing excess'
        ReDim Preserve Out(0 To j - 1)

        Return New SqlBytes(Out)
    End Function

    <Microsoft.SqlServer.Server.SqlFunction( _
            DataAccess:=DataAccessKind.None _
            , IsDeterministic:=True _
            , IsPrecise:=True)> _
    Public Shared Function HTMLCopy2(ByVal chars As SqlTypes.SqlBytes) As SqlTypes.SqlBytes
        Dim out() As Byte
        ReDim out(0 To chars.Length - 1)

        For i As Integer = 0 To chars.Length - 1
            out(i) = chars.Buffer(i)
        Next

        Return New SqlBytes(out)
    End Function

    <Microsoft.SqlServer.Server.SqlFunction( _
            DataAccess:=DataAccessKind.None _
            , IsDeterministic:=True _
            , IsPrecise:=True)> _
    Public Shared Function HTMLCopy(ByVal chars As SqlTypes.SqlBytes) As SqlTypes.SqlBytes
        ' Add your code here'
        Return New SqlTypes.SqlBytes(chars.Buffer)
    End Function

End Class

Upvotes: 1

Remus Rusanu
Remus Rusanu

Reputation: 294287

Such string manipulations are best handled by CLR scalar valued functions.

Upvotes: 1

codeulike
codeulike

Reputation: 23064

I don't think there's a better way in TSQL.

If you've got another environment on top of the SQL layer (e.g asp.net) you might have more luck doing the filtering in that.

Upvotes: 3

Related Questions