dhrice
dhrice

Reputation: 89

R match expression multiple times in the same line

I am working with a set of Tweets (very original, I know) in R and would like to extract the text after each @ sign and after each # and put them into separate variables. For example:

This is a test tweet using #twitter. @johnsmith @joesmith.

Ideally I would like it to create new variables in the dataframe that has twitter johnsmith joesmith, etc.

Currently I am using data$at <- str_match(data$tweet_text,"\s@\w+") data$hash <- str_match(data$tweet_text,"\s#\w+")

Which obviously gives me the first occurrence of each into a new variable. Any suggestions?

Upvotes: 0

Views: 488

Answers (1)

Pierre Lapointe
Pierre Lapointe

Reputation: 16277

strsplit and grep will work:

x <-strsplit("This is a test tweet using #twitter. @johnsmith @joesmith."," ")
grep("#|@",unlist(x), value=TRUE)
#[1] "#twitter."  "@johnsmith" "@joesmith."

If you only want to keep the words, no #,@ or .:

out <-grep("#|@",unlist(x), value=TRUE)
gsub("#|@|\\.","",out)
[1] "twitter"   "johnsmith" "joesmith" 

UPDATE Putting the results in a list:

my_list <-NULL

x <-strsplit("This is a test tweet using #twitter. @johnsmith @joesmith."," ")
my_list$hash <-c(my_list$hash,gsub("#|@|\\.","",grep("#",unlist(x), value=TRUE)))
my_list$at <-c(my_list$at,gsub("#|@|\\.","",grep("@",unlist(x), value=TRUE)))

x <-strsplit("2nd tweet using #second. @jillsmith @joansmith."," ")
my_list$hash <-c(my_list$hash,gsub("#|@|\\.","",grep("#",unlist(x), value=TRUE)))
my_list$at <-c(my_list$at,gsub("#|@|\\.","",grep("@",unlist(x), value=TRUE)))

my_list
$hash
[1] "twitter" "second" 

$at
[1] "johnsmith" "joesmith"  "jillsmith" "joansmith"

Upvotes: 2

Related Questions