Reputation: 151

Excel VBA Regex Code to Trim Leading/Trailing White Space and Blank Lines for Multiple Line Text, and Duplicate Words/Spaces/Commas

Can someone think of better Regex Pattern text than what I have created?

I have tested all related Regex code I can find on this subject, none of it works with multiple lines that have many blank spaces and many blank lines of text leading and trailing, and also before and after the text on each line, and multiple lines of blank lines (with spaces on the lines), true garbage text that I need to normalize. I also need to delete duplicate words and spaces and commas.

Within Excel I call the RegEx() function seeking to trim the leading/trailing blank characters (all white space), and leading/trailing blank lines, and to leave only 1 blank line in the middle if multiple blank lines of text exist in the cell.

A1 value is multiline text that has the garbage text described above.

My goal is to trim all lines of blank white space including nonbreaking characters, and delete all leading/trailing blank lines, and to leave only 1 blank line. So it will delete 2+ blank lines and make them only 1.

I want the code to be efficient, and if possible to combine some steps so there are fewer steps.

This code does trim the leading/trailing blank characters and blank lines and leaves one line; it leaves multiple blanks within each row, but it's pretty good.

A1 = see below
It's difficult to input because the multiple lines, but I will type some text that shows, imagine this below as in cell A1:
           
            
           
          
         Several blank lines with spaces before this text and now this. Mark     is a  really      nice guy.       And     
             there is a lot of             
               
       
            
                
       white     space all     over the.     place, and and    ,  , ,,,,,  lots of commas,  ,   , 

      
                   
and , text   text  Text before and  after , commas
          
                         
                     
            many blank characters on every single blank line

Expected results found in F1 should be this:

Several blank lines with spaces before this text and now this. Mark is a really nice guy. And
there is a lot of      
 
white space all over the. place, and, lots of commas,
 
and, text before and after, commas
 
many blank characters on every single blank line

B1 = RegExp(A1, "(\r?\n\s*){2,}", CHAR(10) & CHAR(10), TRUE, TRUE, TRUE)

C1 = RegExp(B1, "^(?:[\t\s\xA0]+)|(?:[\t\s\xA0]+)$", "", TRUE, FALSE, TRUE)

Note: C1 at that point now has trimmed text and only 1 line, so it's a good start.

D1 = RegExp(C1, "[ \xA0]+", " ", TRUE, TRUE, TRUE)

Note: D1 now has duplicate spaces removed.

E1 = RegExp(D1, "[ ,]{2,}", ", ", TRUE, TRUE, TRUE)

Note: E1 now has deleted duplicate spaces and commas, I have a lot of garbage in the text

F1 = RegExp(E1, "\b(\w+)\b[\s\xA0]+(?=\1\b)", "", TRUE, FALSE, TRUE)

Note: F1 now has duplicate words deleted so only one word not, 2+

This is working code, if someone can think of more efficient, cleaner code please let me know.

Public Function RegExp(ByVal text As String, _
    ByVal pattern As String, _
    ByVal strReplace As String, _
    Optional ByVal caseInsensitive As Boolean = True, _
    Optional ByVal multiLines As Boolean = True, _
    Optional ByVal allOccurrences As Boolean = True) As String
    Dim regEx As New RegExp
    With regEx
        .pattern = pattern
        .MultiLine = multiLines
        .IgnoreCase = caseInsensitive
        .Global = allOccurrences
    End With
    RegExp = regEx.Replace(text, strReplace) 'returns modified text (or original text if no modifications)
End Function

Just for documentation of failed attempts that did pretty good:

This code here starts to work, but leaves more than one blank line in the middle

B1 = RegExp(A1, "^[\s\xA0]+|[\s\xA0]+$", "", TRUE, FALSE, TRUE)
C1 = RegExp(B1, "^[\s\xA0]+|[\s\xA0]+$", CHAR(10), TRUE, TRUE, TRUE)

Upvotes: 1

Answers (2)

Cary Swoveland

Reputation: 110685

You can replace matches of the following regular expression with empty strings. The multiline flag needs to be set so that ^ and $ respectively match the beginning and end of a line (rather than the beginning and end of the string). It should work with all regex engines that support capture groups and positive lookaheads.

\A(?: *\r?\n)+|^ *\r?\n(?= *\r?\n)|^ +(?=\S)| +(?=[ ]|$)| +(?=,)|,(?= *,)|(?: *\r?\n)+\Z|\b(\w+)\b +(?=\b\1\b)

Demo

The regular expression can be broken down as follows. Note that space characters have been replaced by a character class containing a space ([ ]) to make the spaces visible.

Match blank lines at the beginning of the string

\A             # match start of string
(?:            # begin a non-capture group
  [ ]*\r?\n    # match >= 0 spaces followed by line terminator 
)+             # end non-capture group and execute it >= 1 times

|              # or

Match lines containing zero or more spaces and nothing else that are followed by lines containing zero or more spaces and nothing else

^              # match beginning of line
[ ]*\r?\n      # match zero or more spaces followed by line terminators
(?=[ ]*\r?\n)  # positive lookahead asserts match is followed by zero or
               # more spaces followed by line terminators 

|              # or

Match spaces at the beginning of a line

^              # match beginning of line
[ ]+           # match one or more spaces
(?=\S)         # positive lookahead asserts match is followed by 
               # a non-whitespace character

|              # or

Match one or more spaces followed by a whitespace or the end of the line

[ ]+           # match one or more spaces
(?=[ ]|$)      # positive lookahead asserts match is followed by
               # a space or is at the end of the line

|              # or

Match spaces followed by a comma

[ ]+           # match one of more spaces
(?=,)          # positive lookahead asserts match is followed by
               # a comma

|              # or

Match a comma followed by zero or more spaces followed by a comma

,              # match a comma
(?=[ ]*,)      # positive lookahead asserts match is followed by
               # zero or more spaces followed by a comma

|              # or

Match blank lines at the end of the string

(?:            # begin a non-capture group
  [ ]*\r?\n    # match >= 0 spaces followed by line terminator 
)+             # end non-capture group and execute it >= 1 times
\Z             # match end of string

|              # or

Match a word followed by the same word, separated by one or more spaces

\b(\w+)\b      # match a word with word boundaries for and aft,
               # and save to capture group 1 
[ ]+           # match one or more spaces 
(?=\b\1\b)     # positive lookahead asserts that match is                    
               # followed by the word in capture group 1, with 
               # word boundaries fore and aft

Upvotes: 3

Tim Williams

Reputation: 166306

This suggestion is not about patterns, but addressing efficiency and speed of execution. If you can use multiple smaller replacements but still get good (or much better!) performance then in my mind that simplifies the overall process.

Here's some sample code showing how to re-use the RegExp objects between calls, instead of creating fresh instances for each call.

If you comment out this line in the RegExp function:

dict.Add k, re 'cache the RegExp against the prepared key

...you can see the difference. Note: you need to press STOP in the VB editor to clear the Static dict object to see the difference.

Without caching I see execution times for Tester of >10 sec: with caching this drops to approx. 0.1 sec (so about 100x faster)


'Run some timings...
Sub Tester()
    Dim n As Long, txt As String, res As String, t
    
    txt = [a1].Value 'your example input text
    t = Timer
    For n = 1 To 1000
        res = FixText(txt)
    Next n
    
    Debug.Print Timer - t
    Debug.Print res
End Sub

'perform a bunch of replacements on `text` and return the result
Function FixText(ByVal text As String)
    Dim arr, n
    
    'array of alternating pattern and replacement text pairs
    arr = Array("(\r?\n\s*){2,}", vbLf, _
                "^(?:[\t\s\xA0]+)|(?:[\t\s\xA0]+)$", "", _
                "[ \xA0]+", " ", _
                "[ ,]{2,}", ", ")
    
    For n = LBound(arr) To UBound(arr) Step 2
        text = RegExp(text, arr(n), arr(n + 1), True, True, True)
    Next n
    
    FixText = text
End Function

'Run a RegExp replacement on string `text`
'Any prepared RegExp objects are cached for potential re-use
Public Function RegExp(ByVal text As String, ByVal pattern As String, ByVal strReplace As String, _
                          Optional ByVal caseInsensitive As Boolean = True, _
                          Optional ByVal multiLines As Boolean = True, _
                          Optional ByVal allOccurrences As Boolean = True) As String

    Static dict As Object  'for caching RegExp objects between calls
    Dim re As Object, k As String
    
    If dict Is Nothing Then Set dict = CreateObject("scripting.dictionary")
    
    'key representing the settings for the RegExp to be used
    k = Join(Array(pattern, IIf(caseInsensitive, "T", "F"), _
                            IIf(multiLines, "T", "F"), _
                            IIf(allOccurrences, "T", "F")), Chr(0))

    If Not dict.Exists(k) Then 'called before with these settings?
        Set re = New RegExp
        With re
            .pattern = pattern
            .MultiLine = multiLines
            .IgnoreCase = caseInsensitive
            .Global = allOccurrences
        End With
        dict.Add k, re  'cache the RegExp against the prepared key
    Else
        Set re = dict(k)
    End If
    RegExp = re.Replace(text, strReplace)
End Function

Upvotes: 0

Excel VBA Regex Code to Trim Leading/Trailing White Space and Blank Lines for Multiple Line Text, and Duplicate Words/Spaces/Commas

Answers (2)

Related Questions