user2867494
user2867494

Reputation:

Capitalize string of words except for prepositional words

I am using the code below to take a string entered from a text box, and convert to capital case, except for words like (the, and, an, as, to, or, on) etc.

Issue #1: I want the first word of the string to always be capitalized regardless of what the word is.

Issue #2: The word spacing is not correct when the string is put back together.

xText = queryForHTML    
xTextSplit = split(xText, " ")

for each item in xTextSplit

    xWord = item

    if lcase(item) = "the" or lcase(item) = "and" or lcase(item) = "an" or lcase(item) = "as" or lcase(item) = "to" or lcase(item) = "is" or lcase(item) = "on" then
        xWord = lcase(item)
    end if

    xCompleteWord = xCompleteWord & " " & xWord

next

queryForHTML = xCompleteWord

Upvotes: 0

Views: 1019

Answers (2)

Mark Bertenshaw
Mark Bertenshaw

Reputation: 5689

The following code is based around the GetStringTypeW() Win32 API function, which provides information about characters in a string. You are only worried about characters which can be upper case or lower case. The problem with your code is that it only works with the simplest case where spaces break up words. But words can be broken up by punctuation. And there are many Unicode characters which have no concept of "upper case" and "lower case".

Instead of writing this boring, error prone code to do this, I take advantage of GetStringTypeW(). I iterate through each element in the array, where each element corresponds to a character in the string in the same position. I have a flag bInWord which stores whether the current position is inside a word. If we hit an upper or lower case character, and this hasn't been set, we set it, and save the current position as the start of the word. In addition, if we have hit an upper case character, and we already know we are in a word, then we there and then make the character lower case, by writing into the returned string. When we hit non-alphabetical characters, or reach the end of the string, and bInWord is set, we then compare the last word with the list of "non-proper-cased" words. If we match, and the first character is upper case, then we overwrite the character with a lower case character. If we don't match, and the first character is lower-case, we overwrite the character with an upper case character.

Option Explicit

Private Declare Function GetStringTypeW Lib "Kernel32.dll" ( _
    ByVal dwInfoType As Long, _
    ByVal lpSrcStr As Long, _
    ByVal cchSrc As Long, _
    ByRef lpCharType As Integer _
) As Long

Private Const CT_CTYPE1                     As Long = &H1

Private Const C1_UPPER                      As Long = &H1     ' Uppercase
Private Const C1_LOWER                      As Long = &H2     ' Lowercase
Private Const C1_DIGIT                      As Long = &H4     ' Decimal digits
Private Const C1_SPACE                      As Long = &H8     ' Space characters
Private Const C1_PUNCT                      As Long = &H10    ' Punctuation
Private Const C1_CNTRL                      As Long = &H20    ' Control characters
Private Const C1_BLANK                      As Long = &H40    ' Blank characters
Private Const C1_XDIGIT                     As Long = &H80    ' Hexadecimal digits
Private Const C1_ALPHA                      As Long = &H100   ' Any linguistic character: alphabetical, syllabary, or ideographic
Private Const C1_DEFINED                    As Long = &H200   ' A defined character, but not one of the other C1_* types

Private Function ProperCaseWords(ByRef in_sText As String) As String

    Dim lTextLen            As Long
    Dim aiCharType()        As Integer
    Dim lPos                As Long
    Dim lPosStartWord       As Long
    Dim bInWord             As Boolean
    Dim bFirstCharUCase     As Boolean
    Dim sWord               As String

    ' Output buffer contains a copy of the original string.
    ProperCaseWords = in_sText

    lTextLen = Len(in_sText)

    ' Resize the character type buffer to be one more than the string.
    ReDim aiCharType(1 To lTextLen + 1)

    ' Retrieve string type data about this Unicode string into <aiCharType()>.
    ' If it fails, then we just return the original string.
    ' Note that the last element in the array is not filled by this function, and will contain zero.
    ' This is deliberate, so we can handle the corner case where the last word is right at the end of the string.
    If (GetStringTypeW(CT_CTYPE1, StrPtr(ProperCaseWords), lTextLen, aiCharType(1))) = 0 Then
        Exit Function
    End If

    ' We start outside a word.
    bInWord = False

    ' Iterate through the entire array, including the last element which corresponds to no character.
    For lPos = 1 To lTextLen + 1

        If (aiCharType(lPos) And C1_LOWER) = C1_LOWER Then
        ' Lower case characters.
            If Not bInWord Then
                bFirstCharUCase = False
                lPosStartWord = lPos
                bInWord = True
            End If
        ElseIf (aiCharType(lPos) And C1_UPPER) = C1_UPPER Then
        ' Upper case characters.
            If bInWord Then
            ' If we are already in the word, i.e. past the first character, then we know that the character *should* be lower case.
                Mid$(ProperCaseWords, lPos, 1) = LCase$(Mid$(ProperCaseWords, lPos, 1))
            Else
                bFirstCharUCase = True
                lPosStartWord = lPos
                bInWord = True
            End If
        Else
        ' Non lower or upper case characters. Also includes last (zero) element.
            If bInWord Then
            ' If we are in a word, and the latest character is non-alphabetical, then we now check what word it is, and
            ' decide whether to make the first character upper or lower case.
                bInWord = False

                ' Retrieve the word from the string, and deliberately make the first character lower case.
                ' Note that all other characters in the word would have already been made lower case.
                sWord = Mid$(ProperCaseWords, lPosStartWord, lPos - lPosStartWord)
                If bFirstCharUCase Then
                    Mid$(sWord, 1, 1) = LCase$(Mid$(sWord, 1, 1))
                End If

                ' Compare our word against a lower-case word list.
                Select Case sWord
                Case "in", "on", "an", "to", "and", "the", "with", "that", "is" ' <=== CUSTOM LIST OF WORDS
                    If bFirstCharUCase Then
                        Mid$(ProperCaseWords, lPosStartWord, 1) = LCase$(Mid$(ProperCaseWords, lPosStartWord, 1))
                    End If
                Case Else
                    If Not bFirstCharUCase Then
                        Mid$(ProperCaseWords, lPosStartWord, 1) = UCase$(Mid$(ProperCaseWords, lPosStartWord, 1))
                    End If
                End Select
            End If
        End If

    Next lPos

End Function

Upvotes: 0

MC ND
MC ND

Reputation: 70941

Option Explicit 

Dim originalString    
    originalString = "a saMple of String capiTalization (in some cases, not so obvious)"

Dim convertedString

Dim noiseWords
    noiseWords= "/a/abaft/aboard/about/above/absent/across/afore/after/against/along/alongside/amid" + _ 
                "/amidst/among/amongst/an/anenst/apropos/apud/around/as/aside/astride/at/athwart/atop" + _ 
                "/barring/before/behind/below/beneath/beside/besides/between/beyond/but/by/circa" + _ 
                "/concerning/despite/down/during/except/excluding/failing/following/for/forenenst/from" + _ 
                "/given/in/including/inside/into/like/mid/midst/minus/modulo/near/next/notwithstanding" + _ 
                "/o/of/off/on/onto/opposite/or/out/outside/over/pace/past/per/plus/pro/qua/regarding" + _ 
                "/round/sans/save/since/so/than/through/thru/throughout/thruout/till/times/to/toward" + _ 
                "/towards/under/underneath/unlike/until/unto/up/upon/versus/vs/via/vice/vis/with/within" + _ 
                "/without/worth/this/"

    Function correctCase(matchString,word,position,sourceString)
        word = LCase(word)
        If (position > 0) And (InStr(noiseWords,"/" & word & "/")>0) Then 
            correctCase = word
        Else
            correctCase = UCase(Left(word,1)) & Mid(word,2,Len(word)-1)
        End If
    End Function 

    With New RegExp
        .Pattern = "(\w+)"
        .Global = True 
        .IgnoreCase = True
        convertedString = .Replace(originalString,GetRef("correctCase"))
    End With 

    WScript.Echo originalString
    WScript.Echo convertedString

The basic idea is to use a regular expression matching any sequence of "word" characters ([a-zA-Z0-9]) and for each sequence, a function is called which receives as parameters the string matches, the capture group containing the word, the position in the string where it has been found and the full source string.

If the word is at position 0 it is capitalized. If the word is a "noise" word, it is lowercased, else, the word is capitalized.

Upvotes: 1

Related Questions