Regex not match pattern followed by horizontal ellipsis in string

Question

I am trying to extract Twitter hashtags from text using regex in R, using str_match_all from the "stringr" package.

The problem is that sometimes the hashtag gets truncated, with a horizontal ellipsis character appended to the end of the text string, as shown in this example:

str_match_all("hello #goodbye #au…","#[[:alnum:]_+]*[^…]")[[1]]

I can successfully extract a list of hashtags, using the above code, but I want to exclude hashtags that are truncated (i.e. that have a horizontal ellipsis character).

This is frustrating as I have looked everywhere for a solution, and the above code is the best I can come up with, but clearly does not work.

Any help is deeply appreciated.

Wiktor Stribiżew · Accepted Answer

I suggest using regmatches with regexpr and the #[^#]+(?!…)\b Perl-style regex:

x <- "#hashtag1 notHashtag #hashtag2 notHashtag #has…"
m <- gregexpr('#[^#\s]+(?!…)\b', x, perl=T)
// or m <- gregexpr('#\w+(?!…)\b', x, perl=T)
// or m <- gregexpr('#\S+(?!…)\b', x, perl=T)
regmatches(x, m)

See demo on CodingGround

The regex means:

# - Literal #
[^#]+ - 1 or more characters other then # (or \w+ to match alphanumerics and underscore only, or \S+ that will match any number of non-whitespace characters)
(?!…)\b - Match a word boundary that is not preceded by a …

Result of the above code execution: [1] "#goodbye"

Regex not match pattern followed by horizontal ellipsis in string

Answers (1)

Related Questions