Reputation: 9019
I am trying to extract some sub-strings from another string, and I have identified patterns that should yield the correct results, however I think there are some small flaws in my implementation.
s = 'Arkansas BaseballMiami (Ohio) at ArkansasFeb 17, 2017 at Fayetteville, Ark. (Baum Stadium)Score by Innings123456789RHEMiami (Ohio)000000000061Arkansas60000010X781Miami (Ohio) starters: 1/lf HALL, D.; 23/3b YACEK; 36/1b HAFFEY; 40/c SENGER; 7/dh HARRIS; 8/rf STEPHENS; 11/ss TEXIDOR; 2/2b VOGELGESANG; 5/cf SADA; 32/p GNETZ;Arkansas starters: 8/dh E. Cole; 9/ss J. Biggers; 17/lf L. Bonfield; 33/c G. Koch; 28/cf D. Fletcher; 20/2b C. Shaddy; 24/1b C Spanberger; 15/rf J. Arledge; 6/3b H. Wilson; 16/p B. Knight;Miami (Ohio) 1st - HALL, D. struck out swinging.'
Here is my attempt at regex formulas to achieve my desired outputs:
teams = re.findall(r'(;|[0-9])(.*?) starters', s)
pitchers = re.findall('/p(.*?);', s)
The pitchers search seems to work, however the teams outputs the following:
[('1', '7, 2017 at Fayetteville, Ark. (Baum Stadium)Score by Innings123456789RHEMiami (Ohio)000000000061Arkansas60000010X781Miami (Ohio)'), ('1', '/lf HALL, D.; 23/3b YACEK; 36/1b HAFFEY; 40/c SENGER; 7/dh HARRIS; 8/rf STEPHENS; 11/ss TEXIDOR; 2/2b VOGELGESANG; 5/cf SADA; 32/p GNETZ;Arkansas')]
DESIRED OUTPUTS:
['Miami (Ohio)', 'Arkansas']
[' GNETZ', ' B. Knight']
I can worry about stripping out the leading spaces in the pitchers names later.
Upvotes: 0
Views: 45
Reputation: 36013
(;|[0-9])
can be replaced with [;0-9]
. Then what I think you're trying to express is "get me the string before starters
and immediately after the last number/semicolon that comes before the starters
", for which you can say "there must be no other numbers/semicolons in between", i.e.
teams = re.findall(r'[;0-9]([^;0-9]*) starters', s)
Upvotes: 1