Reputation: 80
I would like to extract only hashtags from tweets with gsub . For example:
sentence = tweet_text$text
And the result is "The #Sun #Halo is out in full force today People need to look up once in awhile to see", \n "inspired #YouthStrikeClimate #FridayForFuture #FridaysFuture #ClimateChange", \n "Multiple warnings in effect for snow and wind with the latest #storm Metro"
What I am trying to get is only #Sun, #halo from the first one. \n #YouthStrikeClimate, #Friday~~ from the second one. #storm From the last one.
I tried to do this with:
sentence = gsub("^(?!#)","",sentence,perl = TRUE) or
sentence1 = gsub("[^#\\w+]","",sentence,perl = TRUE)
whatever. I already deleted useless words like Numbers or http:// so on
How can I extract them with using gsub
?
Upvotes: 1
Views: 599
Reputation: 887891
In base R
, we can use regmatches/gregexpr
regmatches(x, gregexpr("#\\S+", x))
#[[1]]
#[1] "#Sun" "#Halo"
#[[2]]
#[1] "#YouthStrikeClimate" "#FridayForFuture" "#FridaysFuture" "#ClimateChange"
#[[3]]
#[1] "#storm"
About using the gsub
, either
trimws(gsub("(?<!#)\\b\\S+\\s*", "", x, perl = TRUE))
or
trimws(gsub("(^| )[A-Za-z]+\\b", "", x))
would keep the words that start with #
and separate each word with a space
x <- c("The #Sun #Halo is out in full force today People need to look up once in",
"inspired #YouthStrikeClimate #FridayForFuture #FridaysFuture #ClimateChange",
"Multiple warnings in effect for snow and wind with the latest #storm Metro"
)
Upvotes: 0
Reputation: 389275
We could use str_extract_all
from stringr
and extract all the words followed by a hash. (#
).
stringr::str_extract_all(x, '#\\w+')
#[[1]]
#[1] "#Sun" "#Halo"
#[[2]]
#[1] "#YouthStrikeClimate" "#FridayForFuture" "#FridaysFuture" "#ClimateChange"
#[[3]]
#[1] "#storm"
A base R approach with minimal regex. We split the string on whitespace and select only those words which startsWith
#
.
sapply(strsplit(x, "\\s+"), function(p) p[startsWith(p, "#")])
data
x <- c("The #Sun #Halo is out in full force today People need to look up once in",
"inspired #YouthStrikeClimate #FridayForFuture #FridaysFuture #ClimateChange",
"Multiple warnings in effect for snow and wind with the latest #storm Metro")
Upvotes: 2