Reputation: 547

match tickers using regular expression

I need to extract tickers (which are stock symbols is an abbreviation) from tweets, those tickers starts with $ (dollar sign) and composed of Uppercase letters and sometime "-". This is an example below:

str = "VG Acquisition Has The Potential To Fly High $SPCE $STPK $VG-AC price is $0.88"

I tries many regex but none of them returned what I need:

\b\$.*\b
[$].*\s     
[$].*\b
[$].*\s$

I need to match:

$SPCE 
$STPK 
$VG-AC

Upvotes: 2

Answers (4)

The fourth bird

Reputation: 163457

You can match 1 or more uppercase chars A-Z.

Then optionally repeat matching - and 1 or more uppercase chars A-Z.

\$[A-Z]+(?:-[A-Z]+)*\b

Explanation

\$[A-Z]+ Match $ and 1 or more uppercase chars A-Z
(?: Non capture group
- -[A-Z]+ Match - and 1 or more uppercase chars A-Z
)* Close group and repeat 0+ times
\b A word boundary

Regex demo | Python demo

For example

import re
 
regex = r"\$[A-Z]+(?:-[A-Z]+)*\b"
s = "VG Acquisition Has The Potential To Fly High $SPCE $STPK $VG-AC price is $0.88"
print(re.findall(regex, s))

Output

['$SPCE', '$STPK', '$VG-AC']

Upvotes: 0

hd1

Reputation: 34677

pytickersymbols, if it does what it says on the tin, should serve your purpose well. From the tests:

import yfinance as yf
y_ticker = yf.Ticker('GOOG')
data = y_ticker.history(period='4d')

Upvotes: 1

Ryszard Czech

Reputation: 18621

Use

re.findall(r'\$(?!\d+\.\d)\S+', text)

See proof.

Explanation

--------------------------------------------------------------------------------
  \$                       '$'
--------------------------------------------------------------------------------
  (?!                      look ahead to see if there is not:
--------------------------------------------------------------------------------
    \d+                      digits (0-9) (1 or more times (matching
                             the most amount possible))
--------------------------------------------------------------------------------
    \.                       '.'
--------------------------------------------------------------------------------
    \d                       digits (0-9)
--------------------------------------------------------------------------------
  )                        end of look-ahead
--------------------------------------------------------------------------------
  \S+                      non-whitespace (all but \n, \r, \t, \f,
                           and " ") (1 or more times (matching the
                           most amount possible))

Upvotes: 0

CyDevos

Reputation: 411

I would have suggested something like that: re.findall(r’\$[A-Z-?]+’, text)

\$ = Start with $

[A-Z-?]+ = match uppercase letter with dash as a possibility. The + at the end for repeatability.

This regex works even with this pattern: ABS-DE-CE

Upvotes: 0

match tickers using regular expression

Answers (4)

Related Questions