Scott Buchanan
Scott Buchanan

Reputation: 1223

Regex to capture all but the last hashtag

I normally sling regexes like it's a native language, but I'm stumped by this puzzle today. I need to capture all the text of a string except for the final hashtag. Any hashtags except for the final one should be included, and it also needs to match if there are no hashtags at all.

Test Case 1:

Test Case 2:

Test Case 3:

Because of the environment I'm using this in (Zapier), I have a tight constraint that I need the matching string in a single capturing group with the same group number regardless of the case. Zapier uses the Python engine, FWIW.

The context is posting photos from Instagram automatically to Twitter, but needing to limit the length to 280 characters. Since Zapier's truncate function doesn't allow cutting on clean word boundaries, there's the chance that 280 characters could run out in the middle of a hashtag, potentially leading to an embarassing result when Twitter auto-links it. (Zapier's truncate does allow appending an ellipsis, which mitigates the issue for regular words.) Since it's not critical to include every hashtag, I want to throw away the final one, in case it's been truncated.

Upvotes: 1

Views: 236

Answers (3)

user13843220
user13843220

Reputation:

You can use an unrolled loop method.
This is probably the fastest way to do it.

[^#]*(?:\#(?![^#]*$)[^#]*)*

see https://regex101.com/r/vlEows/1/tests

Upvotes: 1

Cary Swoveland
Cary Swoveland

Reputation: 110685

You could match the following regular expression, which conditions on whether the string ends with a hashtag.

^(?:(?=.*#\w+$).*(?=#\w+$)|.*)

Start your engine!

If you need a capture group, use $0, which holds the complete match.

The regex engine performs the following operations.

^              : match beginning of string
(?:            : begin non-capture group
  (?=.*#\w+$)  : positive lookahead asserts that the string
                 ends with a hashtag
  .*           : match 0+ characters
  (?=#\w+$)    : positive lookahead asserts that the next character
                 begins a hashtag at the end of the string
|              : or
  .*           : match 0+ characters
)              : end non-capture group

One could alternatively remove the non-capture group and repeat the beginning-of-string anchor:

^(?=.*#\w+$).*(?=#\w+$)|^.*

Upvotes: 1

Scott Buchanan
Scott Buchanan

Reputation: 1223

Just about as soon as I finished typing this out, I found my own solution (yay, rubber-ducking 🐤 it). Figured I'd post it for anybody else needing this specific strange solution:

((^[^#]+$)|(?:.|\n)+)(?(2)|\s#[^#]+)

Test results: https://regex101.com/r/RNGVSL/2/tests

Update

Simpler answer from Wiktor Stribiżew in comments:

(?s)^(.*?)(?:\s*#[^\s#]+)?$

Test results: https://regex101.com/r/RNGVSL/3/tests

Upvotes: 1

Related Questions