timothyjgraham
timothyjgraham

Reputation: 1232

Regex not match pattern followed by horizontal ellipsis in string

I am trying to extract Twitter hashtags from text using regex in R, using str_match_all from the "stringr" package.

The problem is that sometimes the hashtag gets truncated, with a horizontal ellipsis character appended to the end of the text string, as shown in this example:

str_match_all("hello #goodbye #au…","#[[:alnum:]_+]*[^…]")[[1]]

I can successfully extract a list of hashtags, using the above code, but I want to exclude hashtags that are truncated (i.e. that have a horizontal ellipsis character).

This is frustrating as I have looked everywhere for a solution, and the above code is the best I can come up with, but clearly does not work.

Any help is deeply appreciated.

Upvotes: 1

Views: 980

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627292

I suggest using regmatches with regexpr and the #[^#]+(?!…)\\b Perl-style regex:

x <- "#hashtag1 notHashtag #hashtag2 notHashtag #has…"
m <- gregexpr('#[^#\\s]+(?!…)\\b', x, perl=T)
// or m <- gregexpr('#\\w+(?!…)\\b', x, perl=T)
// or m <- gregexpr('#\\S+(?!…)\\b', x, perl=T)
regmatches(x, m)

See demo on CodingGround

The regex means:

  • # - Literal #
  • [^#]+ - 1 or more characters other then # (or \\w+ to match alphanumerics and underscore only, or \\S+ that will match any number of non-whitespace characters)
  • (?!…)\\b - Match a word boundary that is not preceded by a

Result of the above code execution: [1] "#goodbye"

Upvotes: 1

Related Questions