Pablo Tapia Varela
Pablo Tapia Varela

Reputation: 49

Creating a function to remove only specific word in a list (R)

I have a list with undesirable words (in spanish) which are meaningless, but they are also present inside another. I just want to remove it when they are a term, not when they are a piece of another word.

For example: "la" is an spanish article, but if I use a function to remove it, also will break into two words a useful term like "relacion" (which means relationship)

My first choice was creating a function to remove this terms.

bdtidy$tweet <- #here are tweets
fix.useless <- function(doc) {   
function(doc) {
doc <- gsub("la", ".", doc)
doc <- gsub("las", ".", doc)
doc <- gsub("el", ".", doc)
doc <- gsub("ellos", ".", doc)
doc <- gsub("ellas", ".", doc)
return(doc)
 }

bdtidy$tweet <- sapply(bdtidy$tweet, fix.useless)

My second choice was with a list, and then using filter inside the df

nousar <- c("rt", "pero", "para"...)
new df %>% bdtidy %>%
 filter(!tweet $in$ nousar))

But always the result is removing all those words and breaking terms in two words which makes my analysis useless. Thanks.

Upvotes: 0

Views: 2172

Answers (2)

xilliam
xilliam

Reputation: 2259

One way to remove single words from a string is by flanking the words with spaces, such as this example:

# sample input
x <- c("Get rid of la but not lala")
# pattern with spaces flanking target word
y <- gsub(" la ", " ", x)
# output
> y
[1] "Get rid of but not lala"

Upvotes: 2

struggles
struggles

Reputation: 865

You can tokenize the words. That is, extract the individual words. Once they are extracted you can check the tokens for matches and remove them. The stringr package can help you here

#sample text
text <- "hola, me llamo struggles. El package 'stringr' puede resolver la pregunta."

#normalize text by making everything lowercase
lower_text <- stringr::str_to_lower(text)

#split text at anything that isn't a number or a letter
tokens <- stringr::str_split(lower_text, "[^[:alnum:]]+")

#create a list of stop words
stop_words <- c('la', 'las', 'el', 'ellos')

#remove words that are in the stop words vector
tokens[[1]][!tokens[[1]] %in% stop_words]

Since you'll probably be doing this with a lot of tweets, I suggest you also take a look at the tidytext package and read through the tutorial https://www.tidytextmining.com/

df <- data.frame(
  tweet = text,
  tweet_id = 1234,
  user = 'struggles',
  stringsAsFactors = F
)

twitter_tokens <- tidytext::unnest_tokens(df, word, tweet)

clean_twitter_tokens <- dplyr::filter(twitter_tokens, !word %in% stop_words)

and this will give you something like

  tweet_id      user      word
1     1234 struggles      hola
2     1234 struggles        me
3     1234 struggles     llamo
4     1234 struggles struggles
5     1234 struggles   package
6     1234 struggles   stringr
7     1234 struggles     puede
8     1234 struggles  resolver
9     1234 struggles  pregunta

And if you want to keep it together in one sentence then the following will bring it back:

lean_twitter_tokens %>%
  dplyr::group_by(tweet_id, user) %>%
  dplyr::summarize(tweet = stringr::str_c(word, collapse = ' '))

giving you

  tweet_id user      tweet                                                          
     <dbl> <chr>     <chr>                                                          
1     1234 struggles hola me llamo struggles package stringr puede resolver pregunta

Upvotes: 0

Related Questions