Reputation: 369
I am working with some UK address data which within an Excel cell is split into its constituent parts by a comma.
I have some VBA which I've taken from the web which has removed a number of exact duplicated entries but I am left with a large amount of data which has repeating segments some sequentially and some non sequentially.
Attached is an image highlighting what I am trying to achieve, the code I have used thus far which is not mine is included to show you the direction in which I have been looking. Anyone have any further thoughts on how this can be achieved?
Function stringOfUniques(inputString As String, delimiter As String)
Dim xVal As Variant
Dim dict As Object
Set dict = CreateObject("Scripting.Dictionary")
For Each xVal In Split(inputString, delimiter)
dict(xVal) = xVal
Next xVal
stringOfUniques = Join(dict.Keys(), ",")
End Function
This did manage to get rid of a number of them but there is a huge population that I am working on so automating this would be incredible.
Upvotes: 3
Views: 1602
Reputation: 42528
A first solution would be to use a dictionary to get a list of unique segments. It would then be as simple as skipping the first address number before splitting the segments:
Function RemoveDuplicates1(text As String) As String
Static dict As Object
If dict Is Nothing Then
Set dict = CreateObject("Scripting.Dictionary")
dict.CompareMode = 1 ' set the case sensitivity to All
Else
dict.RemoveAll
End If
' Get the position just after the address number
Dim c&, istart&, segment
For istart = 1 To Len(text)
c = Asc(Mid$(text, istart, 1))
If (c < 48 Or c > 57) And c <> 32 Then Exit For ' if not [0-9 ]
Next
' Split the segments and add each one of them to the dictionary. No need to keep
' a reference to each segment since the keys are returned by order of insertion.
For Each segment In Split(Mid$(text, istart), ",")
If Len(segment) Then dict(segment) = Empty
Next
' Return the address number and the segments by joining the keys
RemoveDuplicates1 = Mid$(text, 1, istart - 1) & Join(dict.keys(), ",")
End Function
A second solution would be to extract all the segments and then search if each one of them is present at a previous position:
Function RemoveDuplicates2(text As String) As String
Dim c&, segments$, segment$, length&, ifirst&, istart&, iend&
' Get the position just after the address number
For ifirst = 1 To Len(text)
c = Asc(Mid$(text, ifirst, 1))
If (c < 48 Or c > 57) And c <> 32 Then Exit For ' if not [0-9 ]
Next
' Get the segments without the address number and add a leading/trailing comma
segments = "," & Mid$(text, ifirst) & ","
istart = 1
' iterate each segment
Do While istart < Len(segments)
' Get the next segment position
iend = InStr(istart + 1, segments, ",") - 1 And &HFFFFFF
If iend - istart Then
' Get the segment
segment = Mid$(segments, istart, iend - istart + 2)
' Rewrite the segment if not present at a previous position
If InStr(1, segments, segment, vbTextCompare) = istart Then
Mid$(segments, length + 1) = segment
length = length + Len(segment) - 1
End If
End If
istart = iend + 1
Loop
' Return the address number and the segments
RemoveDuplicates2 = Mid$(text, 1, ifirst - 1) & Mid$(segments, 2, length - 1)
End Function
And a third solution would be to use a regular expression to remove all the duplicated segments:
Function RemoveDuplicates3(ByVal text As String) As String
Static re As Object
If re Is Nothing Then
Set re = CreateObject("VBScript.RegExp")
re.Global = True
re.IgnoreCase = True
' Match any duplicated segment separated by a comma.
' The first segment is compared without the first digits.
re.Pattern = "((^\d* *|,)([^,]+)(?=,).*),\3?(?=,|$)"
End If
' Remove each matching segment
Do While re.test(text)
text = re.Replace(text, "$1")
Loop
RemoveDuplicates3 = text
End Function
These are the execution times for 10000 iterations (the lower the better):
input text : "123 abc,,1 abc,abc 2,ABC,abc,a,c"
output text : "123 abc,1 abc,abc 2,a,c"
RemoveDuplicates1 (dictionary) : 718 ms
RemoveDuplicates2 (text search) : 219 ms
RemoveDuplicates3 (regex) : 1469 ms
Upvotes: 1
Reputation: 626927
You may really use a regex replacement:
^(\d*\s*([^,]*),.*)\2(,|$)
The replacement pattern is
$1$3
See the regex demo. The pattern explanation:
^
- start of a string (or of a line if .MultiLine = True
)(\d*\s*([^,]*),.*)
- Group 1 (later referenced to with $1
backreference from the replacement pattern) matching:
\d*
- 0+ digits followed with\s*
- 0+ whitespace characters([^,]*)
- Group 2 (later we can use \2
in-pattern backreference to refer to the value captured with this subpattern) matching 0+ characters other than a comma,.*
- a comma followed with 0+ characters other than a newline\2
- the text captured by Group 2(,|$)
- Group 3 (later referenced to with $3
from the replacement pattern - to restore the comma) matching either a comma or the end of string (or line if .MultiLine = True
).NOTE: You do not need .MultiLine = True
if you just check individual cells with containing one address.
Below is a sample VBA Sub showing how this can be used in VBA:
Sub test()
Dim regEx As Object
Set regEx = CreateObject("VBScript.RegExp")
With regEx
.pattern = "^(\d*\s*([^,]*),.*)\2(,|$)"
.Global = True
.MultiLine = True ' Remove if individual addresses are matched
End With
s = "66 LAUSANNE ROAD,LAUSANNE ROAD,HORNSEY" & vbCrLf & _
"9 CARNELL LANE,CARNELL LANE,FERNWOOD" & vbCrLf & _
"35 FLAT ANDERSON HEIGHTS,1001 LONDON ROAD,FLAT ANDERSON HEIGHTS" & vbCrLf & _
"27 RUSSELL BANK ROAD,RUSSEL BANK,SUTTON COLDFIELD"
MsgBox regEx.Replace(s, "$1$3")
End Sub
Upvotes: 3
Reputation: 1425
Possibly not the most elegant answer, but this does the trick. Here I use the Split command to split the string at each comma. The result returned from this is
bat ball banana
Code:
Option Explicit
Private Sub test()
Dim Mystring As String
Dim StrResult As String
Mystring = "bat,ball,bat,ball,banana"
StrResult = shed_duplicates(Mystring)
End Sub
Private Function shed_duplicates(ByRef Mystring As String) As String
Dim MySplitz() As String
Dim J As Integer
Dim K As Integer
Dim BooMatch As Boolean
Dim StrTemp(10) As String ' assumes no more than 10 possible splits!
Dim StrResult As String
MySplitz = Split(Mystring, ",")
For J = 0 To UBound(MySplitz)
BooMatch = False
For K = 0 To UBound(StrTemp)
If MySplitz(J) = StrTemp(K) Then
BooMatch = True
Exit For
End If
Next K
If Not BooMatch Then
StrTemp(J) = MySplitz(J)
End If
Next
For J = 0 To UBound(StrTemp)
If Len(StrTemp(J)) > 0 Then ' ignore blank entries
StrResult = StrResult + StrTemp(J) + " "
End If
Next J
Debug.Print StrResult
End Function
Upvotes: 4