Oasis
Oasis

Reputation: 80

How to extract hashtags with gsub

I would like to extract only hashtags from tweets with gsub . For example:

sentence = tweet_text$text

And the result is "The #Sun #Halo is out in full force today People need to look up once in awhile to see", \n "inspired #YouthStrikeClimate #FridayForFuture #FridaysFuture #ClimateChange", \n "Multiple warnings in effect for snow and wind with the latest #storm Metro"

What I am trying to get is only #Sun, #halo from the first one. \n #YouthStrikeClimate, #Friday~~ from the second one. #storm From the last one.

I tried to do this with:

sentence = gsub("^(?!#)","",sentence,perl = TRUE) or 
sentence1 = gsub("[^#\\w+]","",sentence,perl = TRUE)

whatever. I already deleted useless words like Numbers or http:// so on

How can I extract them with using gsub?

Upvotes: 1

Views: 599

Answers (2)

akrun
akrun

Reputation: 887891

In base R, we can use regmatches/gregexpr

regmatches(x, gregexpr("#\\S+", x))
#[[1]]
#[1] "#Sun"  "#Halo"

#[[2]]
#[1] "#YouthStrikeClimate" "#FridayForFuture"    "#FridaysFuture"      "#ClimateChange"     

#[[3]]
#[1] "#storm"

About using the gsub, either

trimws(gsub("(?<!#)\\b\\S+\\s*", "", x, perl = TRUE))

or

trimws(gsub("(^| )[A-Za-z]+\\b", "", x))

would keep the words that start with # and separate each word with a space

data

x <- c("The #Sun #Halo is out in full force today People need to look up once in", 
"inspired #YouthStrikeClimate #FridayForFuture #FridaysFuture #ClimateChange", 
 "Multiple warnings in effect for snow and wind with the latest #storm       Metro"
 )

Upvotes: 0

Ronak Shah
Ronak Shah

Reputation: 389275

We could use str_extract_all from stringr and extract all the words followed by a hash. (#).

stringr::str_extract_all(x, '#\\w+')

#[[1]]
#[1] "#Sun"  "#Halo"

#[[2]]
#[1] "#YouthStrikeClimate" "#FridayForFuture" "#FridaysFuture"  "#ClimateChange"

#[[3]]
#[1] "#storm"

A base R approach with minimal regex. We split the string on whitespace and select only those words which startsWith #.

sapply(strsplit(x, "\\s+"), function(p) p[startsWith(p, "#")])

data

x <- c("The #Sun #Halo is out in full force today People need to look up once in", 
  "inspired #YouthStrikeClimate #FridayForFuture #FridaysFuture #ClimateChange", 
  "Multiple warnings in effect for snow and wind with the latest #storm  Metro")

Upvotes: 2

Related Questions