Reputation: 151
Can someone think of better Regex Pattern text than what I have created?
I have tested all related Regex code I can find on this subject, none of it works with multiple lines that have many blank spaces and many blank lines of text leading and trailing, and also before and after the text on each line, and multiple lines of blank lines (with spaces on the lines), true garbage text that I need to normalize. I also need to delete duplicate words and spaces and commas.
Within Excel I call the RegEx() function seeking to trim the leading/trailing blank characters (all white space), and leading/trailing blank lines, and to leave only 1 blank line in the middle if multiple blank lines of text exist in the cell.
A1 value is multiline text that has the garbage text described above.
My goal is to trim all lines of blank white space including nonbreaking characters, and delete all leading/trailing blank lines, and to leave only 1 blank line. So it will delete 2+ blank lines and make them only 1.
I want the code to be efficient, and if possible to combine some steps so there are fewer steps.
This code does trim the leading/trailing blank characters and blank lines and leaves one line; it leaves multiple blanks within each row, but it's pretty good.
A1 = see below
It's difficult to input because the multiple lines, but I will type some text that shows, imagine this below as in cell A1:
Several blank lines with spaces before this text and now this. Mark is a really nice guy. And
there is a lot of
white space all over the. place, and and , , ,,,,, lots of commas, , ,
and , text text Text before and after , commas
many blank characters on every single blank line
Expected results found in F1 should be this:
Several blank lines with spaces before this text and now this. Mark is a really nice guy. And
there is a lot of
white space all over the. place, and, lots of commas,
and, text before and after, commas
many blank characters on every single blank line
B1 = RegExp(A1, "(\r?\n\s*){2,}", CHAR(10) & CHAR(10), TRUE, TRUE, TRUE)
C1 = RegExp(B1, "^(?:[\t\s\xA0]+)|(?:[\t\s\xA0]+)$", "", TRUE, FALSE, TRUE)
Note: C1 at that point now has trimmed text and only 1 line, so it's a good start.
D1 = RegExp(C1, "[ \xA0]+", " ", TRUE, TRUE, TRUE)
Note: D1 now has duplicate spaces removed.
E1 = RegExp(D1, "[ ,]{2,}", ", ", TRUE, TRUE, TRUE)
Note: E1 now has deleted duplicate spaces and commas, I have a lot of garbage in the text
F1 = RegExp(E1, "\b(\w+)\b[\s\xA0]+(?=\1\b)", "", TRUE, FALSE, TRUE)
Note: F1 now has duplicate words deleted so only one word not, 2+
This is working code, if someone can think of more efficient, cleaner code please let me know.
Public Function RegExp(ByVal text As String, _
ByVal pattern As String, _
ByVal strReplace As String, _
Optional ByVal caseInsensitive As Boolean = True, _
Optional ByVal multiLines As Boolean = True, _
Optional ByVal allOccurrences As Boolean = True) As String
Dim regEx As New RegExp
With regEx
.pattern = pattern
.MultiLine = multiLines
.IgnoreCase = caseInsensitive
.Global = allOccurrences
End With
RegExp = regEx.Replace(text, strReplace) 'returns modified text (or original text if no modifications)
End Function
Just for documentation of failed attempts that did pretty good:
This code here starts to work, but leaves more than one blank line in the middle
B1 = RegExp(A1, "^[\s\xA0]+|[\s\xA0]+$", "", TRUE, FALSE, TRUE)
C1 = RegExp(B1, "^[\s\xA0]+|[\s\xA0]+$", CHAR(10), TRUE, TRUE, TRUE)
Upvotes: 1
Views: 162
Reputation: 110685
You can replace matches of the following regular expression with empty strings. The multiline flag needs to be set so that ^
and $
respectively match the beginning and end of a line (rather than the beginning and end of the string). It should work with all regex engines that support capture groups and positive lookaheads.
\A(?: *\r?\n)+|^ *\r?\n(?= *\r?\n)|^ +(?=\S)| +(?=[ ]|$)| +(?=,)|,(?= *,)|(?: *\r?\n)+\Z|\b(\w+)\b +(?=\b\1\b)
The regular expression can be broken down as follows. Note that space characters have been replaced by a character class containing a space ([ ]
) to make the spaces visible.
Match blank lines at the beginning of the string
\A # match start of string
(?: # begin a non-capture group
[ ]*\r?\n # match >= 0 spaces followed by line terminator
)+ # end non-capture group and execute it >= 1 times
| # or
Match lines containing zero or more spaces and nothing else that are followed by lines containing zero or more spaces and nothing else
^ # match beginning of line
[ ]*\r?\n # match zero or more spaces followed by line terminators
(?=[ ]*\r?\n) # positive lookahead asserts match is followed by zero or
# more spaces followed by line terminators
| # or
Match spaces at the beginning of a line
^ # match beginning of line
[ ]+ # match one or more spaces
(?=\S) # positive lookahead asserts match is followed by
# a non-whitespace character
| # or
Match one or more spaces followed by a whitespace or the end of the line
[ ]+ # match one or more spaces
(?=[ ]|$) # positive lookahead asserts match is followed by
# a space or is at the end of the line
| # or
Match spaces followed by a comma
[ ]+ # match one of more spaces
(?=,) # positive lookahead asserts match is followed by
# a comma
| # or
Match a comma followed by zero or more spaces followed by a comma
, # match a comma
(?=[ ]*,) # positive lookahead asserts match is followed by
# zero or more spaces followed by a comma
| # or
Match blank lines at the end of the string
(?: # begin a non-capture group
[ ]*\r?\n # match >= 0 spaces followed by line terminator
)+ # end non-capture group and execute it >= 1 times
\Z # match end of string
| # or
Match a word followed by the same word, separated by one or more spaces
\b(\w+)\b # match a word with word boundaries for and aft,
# and save to capture group 1
[ ]+ # match one or more spaces
(?=\b\1\b) # positive lookahead asserts that match is
# followed by the word in capture group 1, with
# word boundaries fore and aft
Upvotes: 3
Reputation: 166306
This suggestion is not about patterns, but addressing efficiency and speed of execution. If you can use multiple smaller replacements but still get good (or much better!) performance then in my mind that simplifies the overall process.
Here's some sample code showing how to re-use the RegExp objects between calls, instead of creating fresh instances for each call.
If you comment out this line in the RegExp
function:
dict.Add k, re 'cache the RegExp against the prepared key
...you can see the difference. Note: you need to press STOP in the VB editor to clear the Static dict
object to see the difference.
Without caching I see execution times for Tester
of >10 sec: with caching this drops to approx. 0.1 sec (so about 100x faster)
'Run some timings...
Sub Tester()
Dim n As Long, txt As String, res As String, t
txt = [a1].Value 'your example input text
t = Timer
For n = 1 To 1000
res = FixText(txt)
Next n
Debug.Print Timer - t
Debug.Print res
End Sub
'perform a bunch of replacements on `text` and return the result
Function FixText(ByVal text As String)
Dim arr, n
'array of alternating pattern and replacement text pairs
arr = Array("(\r?\n\s*){2,}", vbLf, _
"^(?:[\t\s\xA0]+)|(?:[\t\s\xA0]+)$", "", _
"[ \xA0]+", " ", _
"[ ,]{2,}", ", ")
For n = LBound(arr) To UBound(arr) Step 2
text = RegExp(text, arr(n), arr(n + 1), True, True, True)
Next n
FixText = text
End Function
'Run a RegExp replacement on string `text`
'Any prepared RegExp objects are cached for potential re-use
Public Function RegExp(ByVal text As String, ByVal pattern As String, ByVal strReplace As String, _
Optional ByVal caseInsensitive As Boolean = True, _
Optional ByVal multiLines As Boolean = True, _
Optional ByVal allOccurrences As Boolean = True) As String
Static dict As Object 'for caching RegExp objects between calls
Dim re As Object, k As String
If dict Is Nothing Then Set dict = CreateObject("scripting.dictionary")
'key representing the settings for the RegExp to be used
k = Join(Array(pattern, IIf(caseInsensitive, "T", "F"), _
IIf(multiLines, "T", "F"), _
IIf(allOccurrences, "T", "F")), Chr(0))
If Not dict.Exists(k) Then 'called before with these settings?
Set re = New RegExp
With re
.pattern = pattern
.MultiLine = multiLines
.IgnoreCase = caseInsensitive
.Global = allOccurrences
End With
dict.Add k, re 'cache the RegExp against the prepared key
Else
Set re = dict(k)
End If
RegExp = re.Replace(text, strReplace)
End Function
Upvotes: 0