Andrii.Gaidai
Andrii.Gaidai

Reputation: 333

Remove Duplicate Words from input string. Use VBA Regular Expressions

Language: VBA Environment: Excel 2007 Tool: RegEx object

Hi! I'm trying to remove duplicate words from input string that represent an address. I receive an excel worksheet column with somehow combined address pieces. It is not in Cyrillic, but if to represent in English it'll look something like this:

125424, RepeatedName, RepeatedName, and some words, 75
194044, Repeated-dashedName, Repeated-dashedName, other Uniques, 3
300911, Normal non-repeated, names, dashed and non-Dashed, 123

The text is case-insensitive and can contain numbers, punctuation marks, spaces. It is known that duplicate words can only appear one exactly after another, there will not be other "exclusive" word among duplicate instances except comas and spaces. I need to remove duplicate dashed and non-dashed words and keep only single instance if each repeated word. I need to keep the only instance of "RepeatedName" same as "Repeated-dashedName". So the ideal result will look like this:

125424, RepeatedName, and some words, 75
194044, Repeated-dashedName, other Uniques, 3
300911, Normal non-repeated, names, dashed and non-Dashed, 123

To solve this I've tried different variants of code, but working one is escaping from me. My best guess is this:

Option Explicit
Dim strIn As String, strPattern As String, strReplace As String, strResult as String
dim regex As Object

strIn = fnGetInputString()
strPattern = ".*\b((\w+)\b.*\1).*"
strReplace = "$1"

If regex Is Nothing Then Set regex = New RegExp

With regex
  .MultiLine = False
  .Global = True
  .IgnoreCase = True
  .Pattern = strPattern
End With

strResult = regex.Replace(strIn, strReplace)

but as a result, my strResult consequently gave me only this:

75
3
123

So I haven't managed to correctly capture and reuse repeated group in the regex. Any help will be appreciated.

I'm new to regex but have read some documentation, articles, discussions and StackOverflow questions, but have found no working answer.

Upvotes: 2

Views: 1351

Answers (1)

dawg
dawg

Reputation: 103884

This regex works on the example:

\b([a-zA-Z-]+)[^a-zA-Z-]+\1\b

Demo

Essentially, works like so:

\b([a-zA-Z-]+)[^a-zA-Z-]+\1\b
 ^                          ^      assert a word boundary
   ^   ^   ^ ^                     capture a 'word' series of characters
                ^                  separated by non 'word' characters
                         ^         where the captured word is then repeated

The character set [a-zA-Z-] is limited to the ASCII sense of a 'letter' plus a dash. (Make sure the - is at the END of the character class or else you are defining a range.)

For non latin or non-ASCII character sets, you can either use \p{L} in more modern regex engines or reverse the sense of a 'word' by excluding what a word is not:

\b([^ ,]+)[ ,]+\1\b

     ^                    a word is not a space or a comma...
            ^             a word delimiter is a space or comma...

That works even in basic regex engines such as sed

Demo

Upvotes: 3

Related Questions