xeor
xeor

Reputation: 5455

Optional match for beginning of line

I am trying to create a regular expression in Python that matches #hashtags. My definition on a hashtag is:

So in this text

#This string cont#ains #four, and #only four #hashtags.

The hashes here are This, four, only and hashtags.

The problem I have is the optional check for the beginning of line.

Example with +

In []: re.findall('[ \.,]+#([^ \.,]+)', '#This string cont#ains #four, and #only four #hashtags.')
Out[]: ['four', 'only', 'hashtags']

Example with ?

In []: re.findall('[ \.,]?#([^ \.,]+)', '#This string cont#ains #four, and #only four #hashtags.')
Out[]: ['This', 'ains', 'four', 'only', 'hashtags']

How can optional match the beginning of the line?

Upvotes: 0

Views: 383

Answers (2)

Gabber
Gabber

Reputation: 5452

Before your regex you can just tell what you don't want.

(?<!\w)(#[^ \.,]+)

With negative lookbehind you can do that

Upvotes: 0

Blender
Blender

Reputation: 298196

This seems to work:

>>> re.findall(r'\B#([^,\W]+)', '#This string cont#ains #four, and #only four #hashtags.')
['This', 'four', 'only', 'hashtags']
  • \B: Matches the empty string, but only when it is not at the beginning or end of a word. This means that r'py\B' matches 'python', 'py3', 'py2', but not 'py', 'py.', or 'py!'. \B is just the opposite of \b, so is also subject to the settings of LOCALE and UNICODE.
  • \W: When the LOCALE and UNICODE flags are not specified, matches any non-alphanumeric character; this is equivalent to the set [^a-zA-Z0-9_]. With LOCALE, it will match any character not in the set [0-9_], and not defined as alphanumeric for the current locale. If UNICODE is set, this will match anything other than [0-9_] plus characters classied as not alphanumeric in the Unicode character properties database.

Upvotes: 3

Related Questions