Will Rickards
Will Rickards

Reputation: 2811

Highlighting Algorithm - when length of match does not equal length of search string

I have a highlighting algorithm that takes a string and adds highlighting codes around matches in it. The problem I am having is with words like "Find tæst" as the string to be searched and "taest" as the string to find. Since the length of the search string doesn't match the length of the match, I can't accurately find the end of the match. IndexOf in my case is showing me the match but since the combined æ is counted as one character, it is throwing off my detection of the end of the match. I don't think IndexOf will work for me here. Something that returns the index of the match and the length of the match would work. But I don't know what else to use.

    ' cycle through search words and replace them in the text
    For intWord = LBound(m_arrSearchWords) To UBound(m_arrSearchWords)

       If m_arrSearchWords(intWord).Length > 0 Then

          ' replace instances of the word with the word surrounded by bold codes

          ' find starting position
          intPos = strText.IndexOf(m_arrSearchWords(intWord), System.StringComparison.CurrentCultureIgnoreCase)
          Do While intPos <> -1

             strText = strText.Substring(0, (intPos - 1) - 0 + 1) & cstrHighlightCodeOn & strText.Substring(intPos, m_arrSearchWords(intWord).Length) & cstrHighlightCodeOff & strText.Substring(intPos + m_arrSearchWords(intWord).Length)
             intPos = strText.IndexOf(m_arrSearchWords(intWord), intPos + m_arrSearchWords(intWord).Length + cstrHighlightCodeOn.Length + cstrHighlightCodeOff.Length, System.StringComparison.CurrentCultureIgnoreCase)

          Loop

       End If

    Next intWord

The Substring method is failing as the length is beyond the end of the string. I put a fix in for strings that end with the search term (not shown above). But longer strings will be highlighted incorrectly and I need to fix those.

Upvotes: 0

Views: 128

Answers (2)

Floris
Floris

Reputation: 46445

If I understand correctly, you are looking for a function that returns the "matched-string" - in other words, when you are looking for s1 inside s2, then you want to know exactly what part of s2 was matched (index of first and last character matched). This allows you to highlight the match, and doesn't modify the string (upper/lower case, ligature, etc).

I don't have VB.net, and unfortunately VBA doesn't have exactly the same search functionality as VB.net - so please understand that the following code correctly identifies the beginning and end of a match, but it's only tested with upper/lower case matching. I hope this helps you solve the problem.

Option Compare Text
Option Explicit

Function startEndIndex(bigString, smallString)
' function that returns start, end index
' of the match
' it keeps shortening the bigString until no match is found
' this is how it takes care of mismatches in number of characters
' because of a match between "similar" strings
Dim i1, i2
Dim shorterString

i2 = 0

' first see if there is a match at all:
i1 = InStr(1, bigString, smallString, vbTextCompare)

If i1 > 0 Then
  ' largest value that i2 can have is end of string:
  i2 = Len(bigString)

  ' can make it shorter - but no shorter than twice the length of the search string
  If i2 > i1 + 2 * Len(smallString) Then i2 = i1 + 2 * Len(smallString)
  shorterString = Mid(bigString, i1, i2 - i1)

  ' keep making the string shorter until there is no match:
  While InStr(1, shorterString, smallString, vbTextCompare) > 0
    i2 = i2 - 1
    shorterString = Mid(bigString, i1, i2 - i1)
  Wend

End If

' return the values as an array:
startEndIndex = Array(i1, endOfString)

End Function


Sub test()
' a simple test routine to see that things work:
Dim a
Dim longString: longString = "This is a very long TaesT of a complicated string"
a = startEndIndex(longString, "very long taest")
If a(0) = 0 And a(1) = 0 Then
MsgBox "no match found"
Else
Dim highlightString As String
highlightString = Left(longString, a(0) - 1) & "*" & Mid(longString, a(0), a(1) - a(0) + 1) & _
  "*" & Mid(longString, a(1) + 1)
  MsgBox "start at " & a(0) & " and end at " & a(1) & vbCrLf & _
  "string matched is '" & Mid(longString, a(0), a(1) - a(0) + 1) & "'" & vbCrLf & _
  "with highlighting: " & highlightString
End If
End Sub

Upvotes: -1

Will Rickards
Will Rickards

Reputation: 2811

While it would be nice of IndexOf to return the match length, it turns out you can just do the comparison yourself to figure it out. I just do a secondary comparison with a length to find the largest match. I start at the length of the searched for word, which should be the largest. And then work my way backwards to find the length. Once I've found the length I use that. If I don't find it, I work my way up in length. This works if the string I'm searching for is larger or if it is smaller. It means in the normal case at least one extra comparison and in the worst case an additional number based on the length of the search word. Maybe if I had the implementation for IndexOf, I could improve it. But at least this works.

    ' cycle through search words and replace them in the text
    For intWord = LBound(m_arrSearchWords) To UBound(m_arrSearchWords)

       If m_arrSearchWords(intWord).Length > 0 Then

          ' find starting position
          intPos = strText.IndexOf(m_arrSearchWords(intWord), System.StringComparison.CurrentCultureIgnoreCase)
          Do While intPos <> -1

             intOrigLength = m_arrSearchWords(intWord).Length

             ' if there isn't enough of the text left to add the search word length to
             If strText.Length < ((intPos + intOrigLength - 1) - 0 + 1) Then

                ' use shorter length
                intOrigLength = ((strText.Length - 1) - intPos + 1)

             End If

             ' find largest match
             For intLength = intOrigLength To 1 Step -1

                If m_arrSearchWords(intWord).Equals(strText.Substring(intPos, intLength), StringComparison.CurrentCultureIgnoreCase) Then

                   ' if match found, highlight it
                   strText = strText.Substring(0, (intPos - 1) - 0 + 1) & cstrHighlightCodeOn & strText.Substring(intPos, intLength) & cstrHighlightCodeOff & strText.Substring(intPos + intLength)

                   ' find next
                   intPos = strText.IndexOf(m_arrSearchWords(intWord), intPos + intLength + cstrHighlightCodeOn.Length + cstrHighlightCodeOff.Length, System.StringComparison.CurrentCultureIgnoreCase)

                   ' exit search for largest match
                   Exit For

                End If

             Next

             ' if we didn't find it by searching smaller - search larger
             If intLength = 0 Then

                For intLength = intOrigLength + 1 To ((strText.Length - 1) - intPos + 1)

                   If m_arrSearchWords(intWord).Equals(strText.Substring(intPos, intLength), StringComparison.CurrentCultureIgnoreCase) Then

                      ' if match found, highlight it
                      strText = strText.Substring(0, (intPos - 1) - 0 + 1) & cstrHighlightCodeOn & strText.Substring(intPos, intLength) & cstrHighlightCodeOff & strText.Substring(intPos + intLength)

                      ' find next
                      intPos = strText.IndexOf(m_arrSearchWords(intWord), intPos + intLength + cstrHighlightCodeOn.Length + cstrHighlightCodeOff.Length, System.StringComparison.CurrentCultureIgnoreCase)

                      ' exit search for largest match
                      Exit For

                   End If

                Next

             End If

          Loop

       End If

    Next intWord

Upvotes: 0

Related Questions