notrockstar
notrockstar

Reputation: 853

Parsing tweets to extract hashtags in R

I was wondering if anyone has a quick solution to extracting hashtags from the tweets in R. For example, given the following string, how can I parse it to extract the word with the hashtag?

string <- 'Crowdsourcing is awesome. #stackoverflow'

Upvotes: 1

Views: 2088

Answers (2)

Ryan C. Thompson
Ryan C. Thompson

Reputation: 42090

Unlike HTML, I expect you probably can parse hashtags with regex.

library(stringr)
string <- "#hashtag Crowd#sourcing is awesome. #stackoverflow #question"
# I don't use Twitter, so maybe this regex is not right 
# for the set of allowable hashtag characters.
hashtag.regex <- perl("(?<=^|\\s)#\\S+")
hashtags <- str_extract_all(string, hashtag.regex)

Which yields:

> print(hashtags)
[[1]]
[1] "#hashtag"       "#stackoverflow" "#question"     

Note that this also works unmodified if string is actually a vector of many tweets. It returns a list of character vectors.

Upvotes: 6

Thierry
Thierry

Reputation: 18487

Something like this?

string <- c('Crowdsourcing is awesome. #stackoverflow #answer', 
    "another #tag in this tweet")
step1 <- strsplit(string, "#")
step2 <- lapply(step1, tail, -1)
result <- lapply(step2, function(x){
  sapply(strsplit(x, " "), head, 1)
})

Upvotes: 1

Related Questions