WoJ
WoJ

Reputation: 29987

How to match my specific hashtag pattern?

My problem:

I have a string which may include hashtags (see below for the definition), as well as the sequence \n (the characters \ and n - they are the representation of a newline but not a control character (equivalent to the <br> sequence in HTML)

I would like to retrieve the hashtags (in Python, but the question is focused on regex, though if there is a better solution in Python I would be delighted - still I did not add the python tag because it may reduce the scope of the question too much).

A hashtag is defined as:

My solution which almost works (I apologise in advance, my regex skills are almost non-existent so this may be a very bad approach):

#[\w\d][\w\-]*

Please see my attempt on regex101, based on the pattern above and the test set below:

#hashtag some text #hash; #123
 and # not because markdown
# that not
#33 that is not 
either but #3isok 
 or #isok3
astring#andthatshouldnotmatch 
 #hashtagalone
 \n#hashatthebeginning
hello #hashattheend\n
#has_htag 
#ano-the-rone

My concerns:

About line 7: I did not add the possibility of the whitespace in the pattern above because whatever I tried, broke the rest. I thought that merely adding \s* would be enough but I started to match end of lines and whatnot. I could have ended the pattern by "must end with a whitespace or the sequence \n" but I do not know how to do an OR when there is more than one character.

Ultimately if this whitespace at the beginning is a problem then never mind, I will need to be careful with not gluing my hashtags :)

Upvotes: 0

Views: 670

Answers (2)

CertainPerformance
CertainPerformance

Reputation: 370759

First, to start the pattern:

MAY be prefixed by a whitespace or the sequence \n

So, the pattern needs to either start at the beginning of the string, or the character right before it needs to be whitespace, or it needs to be \n. You can alternate between these three possibilities like so:

(?:^|(?<=\s)|(?<=\\n))

(can't alternate between the \s and \\n inside the lookbehind, because that would make it non-fixed width; lookbehinds must be fixed width in almost all flavors)

the next character MUST be either a letter or a digit

If you want only letters and digits to come right after the #, then don't use \w, because \w also matches _. Use a character set instead:

[a-z\d]      # plus case-insensitive flag

the next character MUST be a letter, the sign - or the sign _

Same sort of thing - just put the characters you want into a character set:

[a-z_-]

the following characters MAY be letters, digits, - or _ (0 or more)

[a-z\d_-]*

Put it together, and you get:

(?:^|(?<=\s)|(?<=\\n))#[a-z\d][a-z_-][a-z\d_-]*

https://regex101.com/r/doiLYw/4

Upvotes: 2

Blindy
Blindy

Reputation: 67380

(?:\s|^|\\n)#(\w[-a-zA-Z_][-\w_]*)

I made one modification to your specification, the last rule only allows letters or digits past the 2nd character, but your very last tag in your example looks valid to me so I allowed - and _ as well.

Online test

Upvotes: 1

Related Questions