Chris
Chris

Reputation: 391

Why is this negative look behind wrong?

def get_hashtags(post)
    tags = []
    post.scan(/(?<![0-9a-zA-Z])(#+)([a-zA-Z]+)/){|x,y| tags << y}
    tags
end

Test.assert_equals(get_hashtags("two hashs##in middle of word#"), [])
#Expected: [], instead got: ["in"]

Should it not look behind to see if the match doesnt begin with a word or number? Why is it still accepting 'in' as a valid match?

Upvotes: 1

Views: 101

Answers (1)

Cary Swoveland
Cary Swoveland

Reputation: 110675

You should use \K rather than a negative lookbehind. That allows you to simplify your regex considerably: no need for a pre-defined array, capture groups or a block.

\K means "discard everything matched so far". The key here is that variable-length matches can precede \K, whereas (in Ruby and most other languages) variable-length matches are not permitted in (negative or positive) lookbehinds.

r = /
    [^0-9a-zA-Z#] # do not match any character in the character class
    \#+           # match one or more pound signs
    \K            # discard everything matched so far
    [a-zA-Z]+     # match one or more letters
    /x            # extended mode

Note # in \#+ need not be escaped if I weren't writing the regex in extended mode.

"two hashs##in middle of word#".scan r
  #=> []

"two hashs&#in middle of word#".scan r
  #=> ["in"]

"two hashs#in middle of word&#abc of another word.###def ".scan r
   #=> ["abc", "def"] 

Upvotes: 2

Related Questions