Paula
Paula

Reputation: 15

Remove words that contain each other and leave the longer one

I'm looking for a macro (preferably a function) that would take cell contents, split it into separate words, compare them to one another and remove the shorter words.

Here's an image of what I want the output to look like (I need the words that are crossed out removed):

example

I tried to write a macro myself, but it doesn't work 100% properly because it's not taking the last words and sometimes removes what shouldn't be removed. Also, I have to do this on around 50k cells, so a macro takes a lot of time to run, that's why I'd prefer it to be a function. I guess I shouldn't use the replace function, but I couldn't make anything else work.

Sub clean_words_containing_eachother()
    Dim sht1 As Worksheet
    Dim LastRow As Long
    Dim Cell As Range
    Dim cell_value As String
    Dim word, word2 As Variant

    Set sht1 = ActiveSheet
    col = InputBox("Which column do you want to clear?")

    LastRow = sht1.Cells(sht1.Rows.Count, col).End(xlUp).Row

    Let to_clean = col & "2:" & col & LastRow

    For i = 2 To LastRow
        For Each Cell In sht1.Range(to_clean)
            cell_value = Cell.Value
            cell_split = Split(cell_value, " ")
            For Each word In cell_split
                For Each word2 In cell_split
                    If word <> word2 Then
                        If InStr(word2, word) > 0 Then
                            If Len(word) < Len(word2) Then
                                word = word & " "
                                Cell = Replace(Cell, word, " ")
                            ElseIf Len(word) > Len(word2) Then
                                word2 = word2 & " "
                                Cell = Replace(Cell, word2, " ")
                            End If
                        End If
                    End If
                Next word2
            Next word
        Next Cell
    Next i
End Sub

Upvotes: 0

Views: 43

Answers (2)

Ron Rosenfeld
Ron Rosenfeld

Reputation: 60224

Assuming that the retention of the third word in your first example is an error, since books is contained later on in notebooks:

5003886 book books bound case casebound not notebook notebooks office oxford sign signature

and also assuming that you would want to remove duplicate identical words, even if they are not contained subsequently in another word, then we can use a Regular Expression.

The regex will:

  • Capture each word
  • look-ahead to see if that word exists later on in the string
    • if it does, remove it

Since VBA regexes cannot also look-behind, we work-around this limitation by running the regex a second time on the reversed string.

Then remove the extra spaces and we are done.

Option Explicit
Function cleanWords(S As String) As String
    Dim RE As Object, MC As Object, M As Object
    Dim sTemp As String

Set RE = CreateObject("vbscript.regexp")
With RE
    .Global = True
    .Pattern = "\b(\w+)\b(?=.*\1)"
    .ignorecase = True

    'replace looking forward
    sTemp = .Replace(S, "")

    ' check in reverse
    sTemp = .Replace(StrReverse(sTemp), "")

    'return to normal
    sTemp = StrReverse(sTemp)

    'Remove extraneous spaces
    cleanWords = WorksheetFunction.Trim(sTemp)
End With

End Function

Limitations

  • punctuation will not be removed
  • a "word" is defined as containing only the characters in the class [_A-Za-z0-9] (letters, digits and the underscore).
  • if any words might be hyphenated, or contain other non-word characters
    • in the above, they will be treated as two separate words
    • if you want it treated as a single word, then we might need to change the regex

Upvotes: 1

user7857211
user7857211

Reputation:

General steps:

  • Write cell to array (already working)
  • for each element (x), go through each element (y) (already working)
  • if x is in y AND y is longer that x THEN set x to ""
  • concat array back into string
  • write string to cell

String/array manipulations are much faster than operations on cells, so this will give you some increase in performance (depending on the amount of words you need to replace for each cell).

The "last word problem" might be that you dont have a space after the last word within your cells, since you only replace word + " " with " ".

Upvotes: 0

Related Questions