Reputation: 547
I need to extract tickers (which are stock symbols is an abbreviation) from tweets, those tickers starts with $ (dollar sign) and composed of Uppercase letters and sometime "-". This is an example below:
str = "VG Acquisition Has The Potential To Fly High $SPCE $STPK $VG-AC price is $0.88"
I tries many regex but none of them returned what I need:
\b\$.*\b
[$].*\s
[$].*\b
[$].*\s$
I need to match:
$SPCE
$STPK
$VG-AC
Upvotes: 2
Views: 755
Reputation: 163457
You can match 1 or more uppercase chars A-Z.
Then optionally repeat matching -
and 1 or more uppercase chars A-Z.
\$[A-Z]+(?:-[A-Z]+)*\b
Explanation
\$[A-Z]+
Match $
and 1 or more uppercase chars A-Z(?:
Non capture group
-[A-Z]+
Match -
and 1 or more uppercase chars A-Z)*
Close group and repeat 0+ times\b
A word boundaryFor example
import re
regex = r"\$[A-Z]+(?:-[A-Z]+)*\b"
s = "VG Acquisition Has The Potential To Fly High $SPCE $STPK $VG-AC price is $0.88"
print(re.findall(regex, s))
Output
['$SPCE', '$STPK', '$VG-AC']
Upvotes: 0
Reputation: 34677
pytickersymbols, if it does what it says on the tin, should serve your purpose well. From the tests:
import yfinance as yf
y_ticker = yf.Ticker('GOOG')
data = y_ticker.history(period='4d')
Upvotes: 1
Reputation: 18621
Use
re.findall(r'\$(?!\d+\.\d)\S+', text)
See proof.
Explanation
--------------------------------------------------------------------------------
\$ '$'
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
\d+ digits (0-9) (1 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
\d digits (0-9)
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
\S+ non-whitespace (all but \n, \r, \t, \f,
and " ") (1 or more times (matching the
most amount possible))
Upvotes: 0
Reputation: 411
I would have suggested something like that: re.findall(r’\$[A-Z-?]+’, text)
\$ = Start with $
[A-Z-?]+ = match uppercase letter with dash as a possibility. The + at the end for repeatability.
This regex works even with this pattern: ABS-DE-CE
Upvotes: 0