Reputation: 5455
I am trying to create a regular expression in Python that matches #hashtags. My definition on a hashtag is:
#
[ ,\.]
So in this text
#This string cont#ains #four, and #only four #hashtags.
The hashes here are This
, four
, only
and hashtags
.
The problem I have is the optional check for the beginning of line.
[ \.,]+
won't do it since it won't match the optional beginning.[ \.,]?
won't do it since it matches too much.Example with +
In []: re.findall('[ \.,]+#([^ \.,]+)', '#This string cont#ains #four, and #only four #hashtags.')
Out[]: ['four', 'only', 'hashtags']
Example with ?
In []: re.findall('[ \.,]?#([^ \.,]+)', '#This string cont#ains #four, and #only four #hashtags.')
Out[]: ['This', 'ains', 'four', 'only', 'hashtags']
How can optional match the beginning of the line?
Upvotes: 0
Views: 383
Reputation: 5452
Before your regex you can just tell what you don't want.
(?<!\w)(#[^ \.,]+)
With negative lookbehind you can do that
Upvotes: 0
Reputation: 298196
This seems to work:
>>> re.findall(r'\B#([^,\W]+)', '#This string cont#ains #four, and #only four #hashtags.')
['This', 'four', 'only', 'hashtags']
\B
: Matches the empty string, but only when it is not at the beginning or end of a word. This means that r'py\B'
matches 'python'
, 'py3'
, 'py2'
, but not 'py'
, 'py.'
, or 'py!'
. \B
is just the opposite of \b
, so is also subject to the settings of LOCALE
and UNICODE
.\W
: When the LOCALE
and UNICODE
flags are not specified, matches any non-alphanumeric character; this is equivalent to the set [^a-zA-Z0-9_]
. With LOCALE, it will match any character not in the set [0-9_]
, and not defined as alphanumeric for the current locale. If UNICODE
is set, this will match anything other than [0-9_]
plus characters classied as not alphanumeric in the Unicode character properties database.Upvotes: 3