Reputation: 853
I was wondering if anyone has a quick solution to extracting hashtags from the tweets in R
.
For example, given the following string, how can I parse it to extract the word with the hashtag?
string <- 'Crowdsourcing is awesome. #stackoverflow'
Upvotes: 1
Views: 2088
Reputation: 42090
Unlike HTML, I expect you probably can parse hashtags with regex.
library(stringr)
string <- "#hashtag Crowd#sourcing is awesome. #stackoverflow #question"
# I don't use Twitter, so maybe this regex is not right
# for the set of allowable hashtag characters.
hashtag.regex <- perl("(?<=^|\\s)#\\S+")
hashtags <- str_extract_all(string, hashtag.regex)
Which yields:
> print(hashtags)
[[1]]
[1] "#hashtag" "#stackoverflow" "#question"
Note that this also works unmodified if string
is actually a vector of many tweets. It returns a list of character vectors.
Upvotes: 6
Reputation: 18487
Something like this?
string <- c('Crowdsourcing is awesome. #stackoverflow #answer',
"another #tag in this tweet")
step1 <- strsplit(string, "#")
step2 <- lapply(step1, tail, -1)
result <- lapply(step2, function(x){
sapply(strsplit(x, " "), head, 1)
})
Upvotes: 1