Mr Deans
Mr Deans

Reputation: 369

Excel - String Remove Duplicates

I am working with some UK address data which within an Excel cell is split into its constituent parts by a comma.

I have some VBA which I've taken from the web which has removed a number of exact duplicated entries but I am left with a large amount of data which has repeating segments some sequentially and some non sequentially.

Attached is an image highlighting what I am trying to achieve, the code I have used thus far which is not mine is included to show you the direction in which I have been looking. Anyone have any further thoughts on how this can be achieved?

Function stringOfUniques(inputString As String, delimiter As String)
Dim xVal As Variant
Dim dict As Object
Set dict = CreateObject("Scripting.Dictionary")

For Each xVal In Split(inputString, delimiter)
dict(xVal) = xVal
Next xVal

stringOfUniques = Join(dict.Keys(), ",")
End Function

This did manage to get rid of a number of them but there is a huge population that I am working on so automating this would be incredible.

Ideal Outcome

Upvotes: 3

Views: 1602

Answers (3)

Florent B.
Florent B.

Reputation: 42528

A first solution would be to use a dictionary to get a list of unique segments. It would then be as simple as skipping the first address number before splitting the segments:

Function RemoveDuplicates1(text As String) As String
  Static dict As Object
  If dict Is Nothing Then
    Set dict = CreateObject("Scripting.Dictionary")
    dict.CompareMode = 1  ' set the case sensitivity to All
  Else
    dict.RemoveAll
  End If

  ' Get the position just after the address number
  Dim c&, istart&, segment
  For istart = 1 To Len(text)
    c = Asc(Mid$(text, istart, 1))
    If (c < 48 Or c > 57) And c <> 32 Then Exit For  ' if not [0-9 ]
  Next

  ' Split the segments and add each one of them to the dictionary. No need to keep 
  ' a reference to each segment since the keys are returned by order of insertion.
  For Each segment In Split(Mid$(text, istart), ",")
    If Len(segment) Then dict(segment) = Empty
  Next

  ' Return the address number and the segments by joining the keys
  RemoveDuplicates1 = Mid$(text, 1, istart - 1) & Join(dict.keys(), ",")
End Function

A second solution would be to extract all the segments and then search if each one of them is present at a previous position:

Function RemoveDuplicates2(text As String) As String
  Dim c&, segments$, segment$, length&, ifirst&, istart&, iend&

  ' Get the position just after the address number
  For ifirst = 1 To Len(text)
    c = Asc(Mid$(text, ifirst, 1))
    If (c < 48 Or c > 57) And c <> 32 Then Exit For  ' if not [0-9 ]
  Next

  ' Get the segments without the address number and add a leading/trailing comma
  segments = "," & Mid$(text, ifirst) & ","
  istart = 1

  ' iterate each segment
  Do While istart < Len(segments)

    ' Get the next segment position
    iend = InStr(istart + 1, segments, ",") - 1 And &HFFFFFF
    If iend - istart Then

      ' Get the segment
      segment = Mid$(segments, istart, iend - istart + 2)

      ' Rewrite the segment if not present at a previous position
      If InStr(1, segments, segment, vbTextCompare) = istart Then
        Mid$(segments, length + 1) = segment
        length = length + Len(segment) - 1
      End If
    End If

    istart = iend + 1
  Loop

  ' Return the address number and the segments
  RemoveDuplicates2 = Mid$(text, 1, ifirst - 1) & Mid$(segments, 2, length - 1)

End Function

And a third solution would be to use a regular expression to remove all the duplicated segments:

Function RemoveDuplicates3(ByVal text As String) As String

  Static re As Object
  If re Is Nothing Then
    Set re = CreateObject("VBScript.RegExp")
    re.Global = True
    re.IgnoreCase = True
    ' Match any duplicated segment separated by a comma.
    ' The first segment is compared without the first digits.
    re.Pattern = "((^\d* *|,)([^,]+)(?=,).*),\3?(?=,|$)"
  End If

  ' Remove each matching segment
  Do While re.test(text)
    text = re.Replace(text, "$1")
  Loop

  RemoveDuplicates3 = text
End Function

These are the execution times for 10000 iterations (the lower the better):

input text  : "123 abc,,1 abc,abc 2,ABC,abc,a,c"
output text : "123 abc,1 abc,abc 2,a,c"

RemoveDuplicates1 (dictionary)  : 718 ms
RemoveDuplicates2 (text search) : 219 ms
RemoveDuplicates3 (regex)       : 1469 ms

Upvotes: 1

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626927

You may really use a regex replacement:

^(\d*\s*([^,]*),.*)\2(,|$)

The replacement pattern is

$1$3

See the regex demo. The pattern explanation:

  • ^ - start of a string (or of a line if .MultiLine = True)
  • (\d*\s*([^,]*),.*) - Group 1 (later referenced to with $1 backreference from the replacement pattern) matching:
    • \d* - 0+ digits followed with
    • \s* - 0+ whitespace characters
    • ([^,]*) - Group 2 (later we can use \2 in-pattern backreference to refer to the value captured with this subpattern) matching 0+ characters other than a comma
    • ,.* - a comma followed with 0+ characters other than a newline
  • \2 - the text captured by Group 2
  • (,|$) - Group 3 (later referenced to with $3 from the replacement pattern - to restore the comma) matching either a comma or the end of string (or line if .MultiLine = True).

NOTE: You do not need .MultiLine = True if you just check individual cells with containing one address.

Below is a sample VBA Sub showing how this can be used in VBA:

Sub test()
  Dim regEx As Object
  Set regEx = CreateObject("VBScript.RegExp")
  With regEx
      .pattern = "^(\d*\s*([^,]*),.*)\2(,|$)"
      .Global = True
      .MultiLine = True ' Remove if individual addresses are matched
  End With
  s = "66 LAUSANNE ROAD,LAUSANNE ROAD,HORNSEY" & vbCrLf & _
      "9 CARNELL LANE,CARNELL LANE,FERNWOOD" & vbCrLf & _
      "35 FLAT ANDERSON HEIGHTS,1001 LONDON ROAD,FLAT ANDERSON HEIGHTS" & vbCrLf & _
      "27 RUSSELL BANK ROAD,RUSSEL BANK,SUTTON COLDFIELD"
  MsgBox regEx.Replace(s, "$1$3")
End Sub

enter image description here

Upvotes: 3

MiguelH
MiguelH

Reputation: 1425

Possibly not the most elegant answer, but this does the trick. Here I use the Split command to split the string at each comma. The result returned from this is

bat ball banana

Code:

Option Explicit
Private Sub test()
 Dim Mystring As String
 Dim StrResult As String

 Mystring = "bat,ball,bat,ball,banana"
 StrResult = shed_duplicates(Mystring)
End Sub
Private Function shed_duplicates(ByRef Mystring As String) As String
 Dim MySplitz() As String
 Dim J As Integer
 Dim K As Integer
 Dim BooMatch As Boolean
 Dim StrTemp(10) As String ' assumes no more than 10 possible splits!
 Dim StrResult As String


 MySplitz = Split(Mystring, ",")
  For J = 0 To UBound(MySplitz)
     BooMatch = False
     For K = 0 To UBound(StrTemp)
         If MySplitz(J) = StrTemp(K) Then
            BooMatch = True
            Exit For
         End If
     Next K
    If Not BooMatch Then
       StrTemp(J) = MySplitz(J)
    End If
Next
For J = 0 To UBound(StrTemp)
   If Len(StrTemp(J)) > 0 Then ' ignore blank entries
      StrResult = StrResult + StrTemp(J) + " "
   End If
Next J
Debug.Print StrResult
End Function

Upvotes: 4

Related Questions