Reputation: 21639
I'm working with very large (45,000,000+ character) strings in VBA, and I need to remove superfluous whitespace.
One space (aka, ASCII Code 32) is okay but any sections with two or more consecutive spaces should be reduced to only one.
I found a similar question here, although that OP's definition of a "very long string" was only 39,000 characters. The accepted answer was a loop using Replace
:
Function MyTrim(s As String) As String
Do While InStr(s, " ") > 0
s = Replace$(s, " ", " ")
Loop
MyTrim = Trim$(s)
End Function
I tried this method and it was "worked", but was painfully slow:
Len In: 44930886 Len Out: 35322469 Runtime: 247.6 seconds
Is there a faster way to remove whitespace from a "very large" string?
Upvotes: 7
Views: 1813
Reputation: 4008
This question is a lot more interesting than the answer makes it seem because there should be nothing wrong with the solution proposed by OP, as the algorithm is plenty efficient in theory.
It turns out, that the problem here is the poor implementation of VBA's inbuilt Replace
function under the hood, which causes it to completely choke on large strings with many replacements.
It is easily possible to manually implement a Replace
function with linear runtime that far outperforms the inbuilt function for large strings. An example of such an implementation is presented here:
'Works like the inbuilt 'Replace', but is much faster on large strings
'with many replacements
'This function is the renamed function `ReplaceFast` from here:
'https://github.com/guwidoe/VBA-StringTools
Public Function Replace(ByRef str As String, _
ByRef sFind As String, _
ByRef sReplace As String, _
Optional ByVal lStart As Long = 1, _
Optional ByVal lCount As Long = -1, _
Optional ByVal lCompare As VbCompareMethod _
= vbBinaryCompare) As String
Const methodName As String = "Replace"
If lStart < 1 Then Err.Raise 5, methodName, _
"Argument 'lStart' = " & lStart & " < 1, invalid"
If lCount < -1 Then Err.Raise 5, methodName, _
"Argument 'lCount' = " & lCount & " < -1, invalid"
lCount = lCount And &H7FFFFFFF
If lCompare <> vbBinaryCompare Or Len(str) < 10000 Or lCount < 10000 Then
Replace = VBA.Replace(str, sFind, sReplace, lStart, lCount, lCompare)
Exit Function
End If
If Len(str) = 0 Or Len(sFind) = 0 Then
Replace = Mid$(str, lStart)
Exit Function
End If
Dim lenFind As Long: lenFind = Len(sFind)
Dim lenReplace As Long: lenReplace = Len(sReplace)
If lenFind = 0 Then Exit Function
Static lFindPositions() As Long
If (Not Not lFindPositions) = 0 Then ReDim lFindPositions(0 To 32767)
Dim numFinds As Long
Dim k As Long: k = InStr(lStart, str, sFind, lCompare)
On Error GoTo catch
Do Until k = 0 Or lCount = numFinds
lFindPositions(numFinds) = k
numFinds = numFinds + 1
k = InStr(k + lenFind, str, sFind, lCompare)
Loop
On Error GoTo 0
GoTo continue
catch:
ReDim Preserve lFindPositions(LBound(lFindPositions) To _
UBound(lFindPositions) * 4)
Resume
continue:
Dim bufferSizeChange As Long
bufferSizeChange = numFinds * (lenReplace - lenFind) - lStart + 1
If Len(str) + bufferSizeChange < 0 Then Exit Function
Replace = Space$(Len(str) + bufferSizeChange)
Dim i As Long
Dim j As Long: j = 1
Dim lastOccurrence As Long: lastOccurrence = lStart
Dim count As Long: count = 1
For k = 0 To numFinds - 1
If count > lCount Then Exit For
i = lFindPositions(k)
Dim diff As Long: diff = i - lastOccurrence
If diff > 0 Then _
Mid$(Replace, j, diff) = Mid$(str, lastOccurrence, diff)
j = j + diff
If lenReplace <> 0 Then
Mid$(Replace, j, lenReplace) = sReplace
j = j + lenReplace
End If
count = count + 1
lastOccurrence = i + lenFind
Next k
If j <= Len(Replace) Then Mid$(Replace, j) = Mid$(str, lastOccurrence)
End Function
Just pasting this code into the project should fix the performance issues of the original code without changing anything about it at all, just by overriding the inbuilt Replace
function.
In my testing, the original code should only take about 3 seconds to process a string similar to OPs example when the improved Replace
function is present, about a 100 times improvement!:
Sub DemoMyTrim()
Const LEN_INPUT_STR = 45000000
Dim inputStr As String
inputStr = String(LEN_INPUT_STR * 2 / 3, "a") & Space(LEN_INPUT_STR / 3)
Dim t As Single: t = Timer()
Dim outStr As String: outStr = MyTrim(inputStr)
Debug.Print "Trimming took " & Timer() - t & " seconds."
Debug.Print "Len Out: " & Len(outStr)
End Sub
Public Function MyTrim(ByRef s As String) As String
MyTrim = s
Do While InStr(MyTrim, " ") > 0
MyTrim = Replace(MyTrim, " ", " ")
Loop
End Function
While this is very interesting, it is still slower than the RegEx solution proposed by the accepted answer.
Since the accepted answer uses Regex which is not available on Mac, I want to present another alternative that is even faster than the original algorithm with the improved Replace
function and still makes do with VBA inbuilt functions that are available on any platform.
This is possible with another function from the LibStringTools library:
'Replaces consecutive occurrences of 'substring' that repeat more than 'limit'
'times with exactly 'limit' consecutive occurrences
'Source:
'https://github.com/guwidoe/VBA-StringTools
Public Function LimitConsecutiveSubstringRepetition( _
ByRef str As String, _
Optional ByRef subStr As String = vbNewLine, _
Optional ByVal limit As Long = 1, _
Optional ByVal Compare As VbCompareMethod _
= vbBinaryCompare) _
As String
Const methodName As String = "LimitConsecutiveSubstringRepetition"
Static recursionDepth As Long
If limit < 0 Then Err.Raise 5, methodName, _
"Argument 'limit' = " & limit & " < 0, invalid"
Dim lenSubStr As Long: lenSubStr = Len(subStr)
Dim lenStr As Long: lenStr = Len(str)
LimitConsecutiveSubstringRepetition = str
If lenStr = 0 Or lenSubStr = 0 Or lenStr < lenSubStr * (limit + 1) Then
Exit Function
End If
If lenSubStr = 1 Then
Dim alSubStr As String: alSubStr = String$(limit + 1, subStr)
Else
alSubStr = Space$(lenSubStr * (limit + 1))
Mid$(alSubStr, 1) = subStr
If limit + 1 > 1 Then Mid$(alSubStr, lenSubStr + 1) = alSubStr
End If
Dim lenAlSubStr As Long: lenAlSubStr = Len(alSubStr)
Dim i As Long: i = InStr(1, str, alSubStr, Compare)
Dim j As Long: j = 1
Dim lastOccurrence As Long: lastOccurrence = 1 - lenSubStr
Dim copyChunkSize As Long
If i = 0 Then Exit Function
Do Until i = 0
i = i + lenSubStr * limit
lastOccurrence = lastOccurrence + lenSubStr
copyChunkSize = i - lastOccurrence
Mid$(LimitConsecutiveSubstringRepetition, j, copyChunkSize) = _
Mid$(str, lastOccurrence, copyChunkSize)
j = j + copyChunkSize
Do
lastOccurrence = i
i = InStr(lastOccurrence + lenSubStr, str, subStr, Compare)
Loop Until i - lastOccurrence <> lenSubStr
If i = 0 Then Exit Do
If limit > 0 Then i = InStr(i, str, alSubStr, Compare)
Loop
copyChunkSize = lenStr - lastOccurrence - lenSubStr + 1
Mid$(LimitConsecutiveSubstringRepetition, j, copyChunkSize) = _
Mid$(str, lastOccurrence + lenSubStr)
If j + copyChunkSize - 1 < Len(LimitConsecutiveSubstringRepetition) Then _
LimitConsecutiveSubstringRepetition = _
Left$(LimitConsecutiveSubstringRepetition, j + copyChunkSize - 1)
Do Until InStr(1, LimitConsecutiveSubstringRepetition, alSubStr, Compare) = 0
Dim s As String: s = LimitConsecutiveSubstringRepetition
lenStr = Len(s)
If lenSubStr = 2 And limit = 0 _
And StrComp(Left$(subStr, 1), Right$(subStr, 1), Compare) <> 0 Then
i = InStr(1, s, alSubStr, Compare)
j = 1
lastOccurrence = 1
Dim leftChar As String: leftChar = Left$(subStr, 1)
Dim rightChar As String: rightChar = Right$(subStr, 1)
Do Until i = 0
Dim l As Long: l = i
Dim r As Long: r = i + 1
Do
l = l - 1
r = r + 1
If l < 1 Then Exit Do
Loop Until StrComp(Mid$(s, l, 1), leftChar, Compare) <> 0 _
Or StrComp(Mid$(s, r, 1), rightChar, Compare) <> 0
copyChunkSize = l + 1 - lastOccurrence
If copyChunkSize > 0 Then _
Mid$(LimitConsecutiveSubstringRepetition, j, copyChunkSize) = _
Mid$(s, lastOccurrence, copyChunkSize)
j = j + copyChunkSize
lastOccurrence = r
i = InStr(r, s, alSubStr, Compare)
Loop
copyChunkSize = lenStr - r + 1
Mid$(LimitConsecutiveSubstringRepetition, j, copyChunkSize) = _
Mid$(s, lastOccurrence)
Else
Dim lSubStr As String: lSubStr = Left$(alSubStr, lenSubStr * limit)
Dim lenlSubStr As Long: lenlSubStr = Len(lSubStr)
Dim minL As Long: minL = 1
Dim maxR As Long
i = InStr(1, s, alSubStr, Compare)
j = 1
lastOccurrence = 1
Do Until i = 0
Dim susChunk As String
susChunk = Space$(lenAlSubStr * 2 - 2 + lenlSubStr)
l = i
r = i + lenAlSubStr
maxR = InStr(r, s, alSubStr, Compare) - 1
If maxR = -1 Then maxR = lenStr
Dim lenLeft As Long, lenRight As Long
Do
If l - lenSubStr + 1 < minL Then
lenLeft = l - minL
Else
lenLeft = lenSubStr - 1
End If
If r + lenSubStr - 2 > maxR Then
lenRight = maxR - r + 1
Else
lenRight = lenSubStr - 1
End If
If lenLeft + lenRight < lenSubStr Then Exit Do
Mid$(susChunk, 1, lenLeft) = Mid$(s, l - lenLeft, lenLeft)
If lenlSubStr > 0 Then _
Mid$(susChunk, lenLeft + 1, lenlSubStr) = lSubStr
Mid$(susChunk, lenLeft + lenlSubStr + 1, lenRight) = _
Mid$(s, r, lenRight)
susChunk = Left$(susChunk, lenLeft + lenRight + lenlSubStr)
Dim n As Long: n = InStr(1, susChunk, alSubStr, Compare)
If n = 0 Then Exit Do
l = l + n - lenLeft - 1
r = r + n + lenSubStr - 1 - lenLeft
Loop
copyChunkSize = l - lastOccurrence
If copyChunkSize > 0 Then _
Mid$(LimitConsecutiveSubstringRepetition, j, copyChunkSize) = _
Mid$(s, lastOccurrence, copyChunkSize)
j = j + copyChunkSize
If limit > 0 Then
Mid$(LimitConsecutiveSubstringRepetition, j, lenlSubStr) = _
Left(alSubStr, lenlSubStr)
j = j + lenlSubStr
Mid$(s, r - lenlSubStr, lenlSubStr) = lSubStr
End If
minL = maxR + 1
lastOccurrence = r
i = InStr(r - lenlSubStr, s, alSubStr, Compare)
Loop
copyChunkSize = lenStr - r + 1
Mid$(LimitConsecutiveSubstringRepetition, j, copyChunkSize) = _
Mid$(s, lastOccurrence)
End If
If j + copyChunkSize - 1 < Len(LimitConsecutiveSubstringRepetition) Then _
LimitConsecutiveSubstringRepetition = _
Left$(LimitConsecutiveSubstringRepetition, j + copyChunkSize - 1)
Loop
End Function
Using this function, the desired result can be achieved as follows:
Dim inputStr as String
'... somehow populate input string
dim outStr as String
outStr = LimitConsecutiveSubstringRepetition(inputStr, " ", 1)
Depending on the structure of the input string, this can be even faster than the RegEx method in some cases.
Upvotes: 2
Reputation: 21639
In VBA, the size of a String
is limited to approximately 2 Billion Characters. The "Replace
-Loop
" method above took 247 seconds for a 45 Million character string, which is over 4 minutes.
Theoretically, that means a 2 Billion character string would take at least 3 hours — if it even finished without crashing — so it's not exactly practical.
Excel has a built-in worksheet function Trim
which is not the same as VBA's Trim
function.
Worksheet function Trim
removes all spaces from text except for single spaces between words.
The problem is that Trim
, like all functions called with Application.WorksheetFunction
, has a size limit of 32,767 characters, and this [unfortunately] applies even when calling the function from VBA with a string that's not even in a cell.
However, we can still use the function if we use it to loop through our "gigantic string" in sections, like this:
EDIT: Don't even bother with this crap (my function, below)! See the RegEx answer above.
Function bigTrim(strIn As String) As String Const maxLen = 32766 Dim loops As Long, x As Long loops = Int(Len(strIn) / maxLen) If (Len(strIn) / maxLen) <> loops Then loops = loops + 1 For x = 1 To loops bigTrim = bigTrim & _ Application.WorksheetFunction.Trim(Mid(strIn, _ ((x - 1) * maxLen) + 1, maxLen)) Next x End Function
Running this function on the same string that was used with the "Replace
-Loop
" method yielded much better results:
Len In: 44930886 Len Out: 35321845 Runtime: 33.6 seconds
That's more than 7x faster than the "Replace
-Loop
" method, and managed to remove 624 spaces that were somehow missed by the other method.
(I though about looking into why the first method missed characters, but since I know my string isn't missing anything, and the point of this exercise was to save time, that would be silly!) ☺
Upvotes: 1
Reputation: 20812
I suspect the performance problem is due to creating a very large number of large intermediate strings. So, any method that does things without creating intermediate strings or with much fewer would perform better.
A Regex replace has a good chance of that.
Option Explicit
Sub Test(ByVal text As String)
Static Regex As Object
If Regex Is Nothing Then
Set Regex = CreateObject("VBScript.RegExp")
Regex.Global = True
Regex.MultiLine = True
End If
Regex.Pattern = " +" ' space, one or more times
Dim result As String: result = Regex.Replace(text, " ")
Debug.Print Len(result), Left(result, 20)
End Sub
With an input string of 45 million characters takes about a second.
Runner:
Sub Main()
Const ForReading As Integer = 1
Const FormatUTF16 As Integer = -1 ' aka TriStateTrue
Dim fso As Object: Set fso = CreateObject("Scripting.FileSystemObject")
Dim file As Object: Set file = fso.OpenTextFile("C:\ProgramData\test.txt", ForReading, False, FormatUTF16)
Dim text As String: text = file.ReadAll()
Set file = Nothing
Set fso = Nothing
Debug.Print Len(text), Left(text, 20)
Test (text)
End Sub
Test data creator (C#):
var substring = "××\n× ×× ";
var text = String.Join("", Enumerable.Repeat(substring, 45_000_000 / substring.Length));
var encoding = new UnicodeEncoding(false, false);
File.WriteAllText(@"C:\ProgramData\test.txt", text, encoding);
BTW—Since VBA (VB4, Java, JavaScript, C#, VB, …) uses UTF-16, the space character is the one UTF-16 code unit ChrW(32)
. (Any similarity to or comparison with ASCII, is unnecessary mental gymnastics, and if put into code as ANSI [Chr(32)
], unnecessary conversion behind the scenes, with different behavior for different machines, users and times.)
Upvotes: 6