Reputation: 1232
I am trying to extract Twitter hashtags from text using regex in R, using str_match_all
from the "stringr" package.
The problem is that sometimes the hashtag gets truncated, with a horizontal ellipsis character appended to the end of the text string, as shown in this example:
str_match_all("hello #goodbye #au…","#[[:alnum:]_+]*[^…]")[[1]]
I can successfully extract a list of hashtags, using the above code, but I want to exclude hashtags that are truncated (i.e. that have a horizontal ellipsis character).
This is frustrating as I have looked everywhere for a solution, and the above code is the best I can come up with, but clearly does not work.
Any help is deeply appreciated.
Upvotes: 1
Views: 980
Reputation: 627292
I suggest using regmatches
with regexpr
and the #[^#]+(?!…)\\b
Perl-style regex:
x <- "#hashtag1 notHashtag #hashtag2 notHashtag #has…"
m <- gregexpr('#[^#\\s]+(?!…)\\b', x, perl=T)
// or m <- gregexpr('#\\w+(?!…)\\b', x, perl=T)
// or m <- gregexpr('#\\S+(?!…)\\b', x, perl=T)
regmatches(x, m)
The regex means:
#
- Literal #
[^#]+
- 1 or more characters other then #
(or \\w+
to match alphanumerics and underscore only, or \\S+
that will match any number of non-whitespace characters)(?!…)\\b
- Match a word boundary that is not preceded by a …
Result of the above code execution: [1] "#goodbye"
Upvotes: 1