Reputation: 33165
I'm trying to replicate Google calendar's method of creating an appointment from a narrative. I want to enter 5pm Happy Hour for 1 hour
and parse it into, ultimately, an Outlook AppointmentItem.
My problem, I think, is I have a large chunk of optional text at the end. And because it's optional, the regex passes but the submatch doesn't get populated because it isn't required for the match. I want it to populate because I want to use the submatches as my parsing engine.
I have a bunch of test cases in column A (working in Excel, then will move to Outlook), and my code lists out the submatches to the right. This is a representative sample of potential input
1. 5pmCST Happy Hour for 1 hour
2. 5pm CST Happy Hour for 1 hour
3. 5pm Happy Hour for 1 hour
4. 5 pm Happy Hour for 1 hour
5. 5 pm CST Happy Hour for 1 hour
6. 5 Happy Hour for 1 hour
7. 5 Happy Hour
8. 5pmCST Happy Hour
9. 5pm CST Happy Hour
10. 5pm Happy Hour
11. 5:00CST Happy Hour for 1 hour
12. 5:00 CST Happy Hour for 1 hour
Here's the code that runs the tests
Sub testest()
Dim RegEx As VBScript_RegExp_55.RegExp
Dim Matches As VBScript_RegExp_55.MatchCollection
Dim Match As VBScript_RegExp_55.Match
Dim rCell As Range
Dim SubMatch As Variant
Dim lCnt As Long
Dim aPattern(1 To 8) As String
Set RegEx = New VBScript_RegExp_55.RegExp
aPattern(1) = "(1?[0-9](:[0-5][0-9])?)" 'time
aPattern(2) = "( ?)" 'optional space
aPattern(3) = "([ap]m)?" 'optional ampm
aPattern(4) = "( ?)" 'optional space
aPattern(5) = "([ECMP][DS]T)?" 'optional time zone
aPattern(6) = "( ?)" 'optional space
aPattern(7) = "(.+?)" 'event description
aPattern(8) = "(( for )([1-2]?[0-9](.[0-9]?[0-9])?)( hours?))?" 'optional duration
RegEx.Pattern = Join(aPattern, vbNullString)
Debug.Print RegEx.Pattern
Sheet1.Range("C1").Resize(1000, 100).ClearContents
For Each rCell In Sheet1.Range("A1").CurrentRegion.Columns(1).Cells
lCnt = 0
rCell.Offset(0, 2).Value = RegEx.test(rCell.Text)
If RegEx.test(rCell.Text) Then
Set Matches = RegEx.Execute(rCell.Text)
For Each Match In Matches
For Each SubMatch In Match.SubMatches
lCnt = lCnt + 1
rCell.Offset(0, 2 + lCnt).Value = SubMatch
Next SubMatch
Next Match
End If
Next rCell
End Sub
The pattern is
(1?[0-9](:[0-5][0-9])?)( ?)([ap]m)?( ?)([ECMP][DS]T)?( ?)(.+?)(( for )([1-2]?[0-9](.[0-9]?[0-9])?)( hours?))?
The submatches for #1 are
1 2 3 4 5 6 7
5 pm CST H
It stops matching at the "H" in Happy Hour because everything starting with the " for " is optional. If I remove the optional part, my pattern becomes
(1?[0-9](:[0-5][0-9])?)( ?)([ap]m)?( ?)([ECMP][DS]T)?( ?)(.+?)( for )([1-2]?[0-9](.[0-9]?[0-9])?)( hours?)
But #7-#10 don't pass because they don't have a duration. The submmatches for #1 give me what I want though
1 2 3 4 5 6 7 8 9 10 11
5 pm CST Happy Hour for 1 hour
I want every possible submatch to fill even if VBScript doesn't need it to to make the regex pass. I fear this is just how it works and that I'm trying to get regex to do my parsing work for me. I considered running it through increasingly more restrictive patterns until it doesn't pass, then using the last passing pattern, but that seems kludgy.
Is it possible to get regex to fill those submatches?
Upvotes: 3
Views: 1645
Reputation: 60354
I have assumed each line is all the contents in a single cell. So I am able to use anchors. I also don't think you need as many capturing groups as you have. I set up the regex with:
Group 1 Time
Group 2 am/pm
Group 3 Time Zone
Group 4 Description
Group 5 Hours (and fractions of hours)
With your data in A2:An, the following routine parses the data into the adjacent columns. It doesn't matter if a Submatch is "not filled". You could also fill elements in an array, or whatever else you want to do. If you want more submatches, you can always either add capturing groups for the optional spaces, or change the relevant non-capturing groups to capturing groups.
Also, since the "for" is optional, I chose to use a lookahead to determine the end of "description". Description will end with either a \s+for\s+ sequence; or with the "end of line". Since I have assumed there is only one entry, and one line, per cell, the multiline and global properties are irrelevant.
One has to include spaces before and after "for" so as to avoid problems if that sequence is included in Description.
Option Explicit
'set Reference to Microsoft VBScript Regular Expressions 5.5
Sub ParseAppt()
Dim R As Range, C As Range
Dim RE As RegExp, MC As MatchCollection
Dim I As Long
Set R = Range("a2", Cells(Rows.Count, "A").End(xlUp))
Set RE = New RegExp
With RE
.Pattern = "((?:1[0-2]|0?[1-9])(?::[0-5]\d)?)\s*([ap]m)?\s*([ECMT][DS]T)?\s*(.*?(?=\s+for\s+|$))(?:\s+for\s+(\d+(?:\.\d+)?)\s*hour)?"
.IgnoreCase = True
For Each C In R
If .Test(C.Text) = True Then
Set MC = .Execute(C.Text)
For I = 0 To 4
C.Offset(0, I + 1) = MC(0).SubMatches(I)
Next I
End If
Next C
End With
End Sub
Upvotes: 2