Rafs
Rafs

Reputation: 814

R tm package's `removeWords` not removing twitter hashtags from tweets due to #

I am trying to remove hashtags from tweets using tm's function removeWords. The hashtags start with # as you know, and I want to remove these tags in their entirety. However, removeWords doesn't remove them:

> library(tm)
> removeWords(x = "WOW it is cool! #Ht https://google.com", words = c("#Ht", "https://google.com"))

[1] "WOW it is cool! #Ht "

If I remove the # from the words argument, the tag is removed:

> removeWords(x = "WOW it is cool! #Ht https://google.com", words = c("Ht", "https://google.com"))
[1] "WOW it is cool! # "

Which leaves the orphan # behind.

Why is this happening? Shouldn't the function remove the words as-is simply, or am I missing something? The manual is not very helpful here.

Upvotes: 0

Views: 408

Answers (4)

user13653858
user13653858

Reputation:

What a nice question! It's a bit tricky: when you look at the source code of tm::removeWords(), you'll see what it does:

gsub(sprintf("(*UCP)\\b(%s)\\b",
             paste(sort(words, decreasing = TRUE), collapse = "|")),
     "", x, perl = TRUE)

It works with word boundaries as @Dason mentions, that's why it's so complicated to extract hashtags. But you can use that as an inspiration to build your own function:

# some tweets
tweets <- rep("WOW it is cool! #Ht https://google.com", times = 1e5)
remove <- c("#Ht", "https://google.com")

# our new function takes not only word boundary from the left side,
# but also a white space or string beginning
removeWords2 <- function(x, words) {
  gsub(sprintf("(\\b|\\s|^)(%s)\\b", paste(sort(words, decreasing = TRUE), collapse = "|")), "", x)
}

# remove words
data <- removeWords2(tweets, remove)

# check that
head(data)
#> [1] "WOW it is cool!" "WOW it is cool!" "WOW it is cool!" "WOW it is cool!"
#> [5] "WOW it is cool!" "WOW it is cool!"

Created on 2020-07-17 by the reprex package (v0.3.0)

It's pretty fast and works as expected, and moreover you can adjust it to your own needs.

Upvotes: 0

phiver
phiver

Reputation: 23608

You could use functions from the textclean package to help you with this.

library(textclean)
txt <- "WOW it is cool! #Ht https://google.com"

# remove twitter hashes
txt <- replace_hash(txt)
# remove urls
txt <- replace_url(txt)

txt
[1] "WOW it is cool!  "

To incorporate this inside tm, use tm_map to call these functions

...
# after creating corpus
my_corpus <- tm_map(my_corpus, content_transformer(replace_hash))
my_corpus <- tm_map(my_corpus, content_transformer(replace_url))
....
# rest of code

Upvotes: 1

Julian_Hn
Julian_Hn

Reputation: 2141

Not using package tm but stringr:

library(stringr)

replaceHashtags <- function(str,tags)
{
  repl <- rep("",length(tags))
  names(repl) <- tags
  return(stringr::str_replace_all(str, repl))
}

ExStr <- "WOW it is cool! #Ht #tag2 https://google.com"
Extags <- c("#Ht","#tag2")
replaceHashtags(ExStr,Extags)

[1] "WOW it is cool!   https://google.com"

This removes all matched hashtags specified in tags from a single string. To apply this to multiple strings just use sapply etc.

Upvotes: 0

Dason
Dason

Reputation: 61973

Unfortunately I can't think of a great way around it. The reason behind what you're seeing is that removeWords relies on using regular expressions with word boundaries. Unfortunately "#" doesn't count as a word boundary so it gets ignored essentially. I hope to see a better answer with a nice workaround but you might just need to do something simple like an initial pass where you replace "#" with some keyword that you add to your list of things to remove in place of the symbol and use that keyword in place of the hashtag when creating your list of words to remove.

Upvotes: 0

Related Questions