Reputation: 167
Problem Introduction
So I've fried my brain trying to get negative look ahead/behinds to work. For the last example input, my current solution returns no match (see expected output table). I'm struggling with how to match the title
part of the string when it includes a year that is not at the end of the string. To be clear, I'm only interested in matching the year
if it is at the end of the string. The current regex fails on the last example, as it is matching NOT("Q" OR "\d*")
in the title
. However, I only want it to match NOT("Q" AND "\d{1}")
. Any tips/suggestions greatly appreciated. Note using Python 3.8.
Example Input
AXP - Earnings call Q2 2021
AXP - Conference call 2021
BAC,BAC.PE,BAC.PL,BACRP,BML.PL,BML.PJ,BML.PH,BML.PG,BAC.PB,BAC.PK,BAC.PM,BAC.PN - Earnings call Q2 2021
GM - General Motors Company (GM) Presents at Deutsche Bank AutoTech Conference
AXP - American Express Company (AXP) Management Presents at Barclays 2020 Global Financial Services Conference
The period
will always be of the form Q[1-4]
. period
and year
are optional. If they do occur, they will be at the end of the string. symbol
and title
are always separated by -
and always occur.
Expected Output
symbol | title | period | year |
---|---|---|---|
AXP | Earnings call | Q2 | 2021 |
AXP | Conference call | 2021 | |
BAC | Earnings call | Q2 | 2021 |
GM | General Motors Company (GM) Presents at Deutsche Bank AutoTech Conference | ||
AXP | American Express Company (AXP) Management Presents at Barclays 2020 Global Financial Services Conference |
What I've Tried
r"^(?P<symbol>[^\,]{1,8})(\,[A-Z\.]+)*\s\-\s(?P<title>[^Q\d]*)\s?(?P<period>Q\d)?\s?(?P<year>19|20\d{2})$"
Upvotes: 1
Views: 119
Reputation: 626747
You can use
^(?P<symbol>[^,]{1,8})(?:,[A-Z.]*)*\s+-\s+(?P<title>(?:(?!Q\d).)*?)\s*(?P<period>Q\d)?\s?(?P<year>(?:19|20)\d{2})?$
See the regex demo.
Note:
[^Q\d]*
is wrong as it matches any zero or more chars other than Q
and digit, you need to match any text up to a Q
+ digit, that is, a (?:(?!Q\d).)*?
tempered greedy token(?P<year>19|20\d{2})
is obligatory, but it must be optional and 19|20
are not grouped, so \d{2}
is only applied to 20
, (?P<year>19|20\d{2})
=> (?P<year>(?:19|20)\d{2})?
.There are other small enhancements here.
Details:
^
- start of string(?P<symbol>[^,]{1,8})
- Group "symbol": one to eight chars other than a comma(?:,[A-Z.]*)*
- zero or more repetitions of a comma and then zero or more uppercase letters/dots\s+-\s+
- a hyphen enclosed with one or more whitespaces(?P<title>(?:(?!Q\d).)*?)
- Group "title": any char other than a line break char, zero or more but as few as possible occurrences, that does not start a Q
+digit char sequence\s*
- zero or more whitespaces(?P<period>Q\d)?
- Group "period": a Q
and a digit\s?
- an optional whitespace(?P<year>(?:19|20)\d{2})?
- an optional Group "year": 19
or 20
and then two digits$
- end of string.Upvotes: 1