Reputation: 89
I am working with a set of Tweets (very original, I know) in R and would like to extract the text after each @ sign and after each # and put them into separate variables. For example:
This is a test tweet using #twitter. @johnsmith @joesmith.
Ideally I would like it to create new variables in the dataframe that has twitter johnsmith joesmith, etc.
Currently I am using data$at <- str_match(data$tweet_text,"\s@\w+") data$hash <- str_match(data$tweet_text,"\s#\w+")
Which obviously gives me the first occurrence of each into a new variable. Any suggestions?
Upvotes: 0
Views: 488
Reputation: 16277
strsplit
and grep
will work:
x <-strsplit("This is a test tweet using #twitter. @johnsmith @joesmith."," ")
grep("#|@",unlist(x), value=TRUE)
#[1] "#twitter." "@johnsmith" "@joesmith."
If you only want to keep the words, no #,@ or .:
out <-grep("#|@",unlist(x), value=TRUE)
gsub("#|@|\\.","",out)
[1] "twitter" "johnsmith" "joesmith"
UPDATE Putting the results in a list
:
my_list <-NULL
x <-strsplit("This is a test tweet using #twitter. @johnsmith @joesmith."," ")
my_list$hash <-c(my_list$hash,gsub("#|@|\\.","",grep("#",unlist(x), value=TRUE)))
my_list$at <-c(my_list$at,gsub("#|@|\\.","",grep("@",unlist(x), value=TRUE)))
x <-strsplit("2nd tweet using #second. @jillsmith @joansmith."," ")
my_list$hash <-c(my_list$hash,gsub("#|@|\\.","",grep("#",unlist(x), value=TRUE)))
my_list$at <-c(my_list$at,gsub("#|@|\\.","",grep("@",unlist(x), value=TRUE)))
my_list
$hash
[1] "twitter" "second"
$at
[1] "johnsmith" "joesmith" "jillsmith" "joansmith"
Upvotes: 2