Reputation: 49
I have a list with undesirable words (in spanish) which are meaningless, but they are also present inside another. I just want to remove it when they are a term, not when they are a piece of another word.
For example: "la" is an spanish article, but if I use a function to remove it, also will break into two words a useful term like "relacion" (which means relationship)
My first choice was creating a function to remove this terms.
bdtidy$tweet <- #here are tweets
fix.useless <- function(doc) {
function(doc) {
doc <- gsub("la", ".", doc)
doc <- gsub("las", ".", doc)
doc <- gsub("el", ".", doc)
doc <- gsub("ellos", ".", doc)
doc <- gsub("ellas", ".", doc)
return(doc)
}
bdtidy$tweet <- sapply(bdtidy$tweet, fix.useless)
My second choice was with a list, and then using filter inside the df
nousar <- c("rt", "pero", "para"...)
new df %>% bdtidy %>%
filter(!tweet $in$ nousar))
But always the result is removing all those words and breaking terms in two words which makes my analysis useless. Thanks.
Upvotes: 0
Views: 2172
Reputation: 2259
One way to remove single words from a string is by flanking the words with spaces, such as this example:
# sample input
x <- c("Get rid of la but not lala")
# pattern with spaces flanking target word
y <- gsub(" la ", " ", x)
# output
> y
[1] "Get rid of but not lala"
Upvotes: 2
Reputation: 865
You can tokenize the words. That is, extract the individual words. Once they are extracted you can check the tokens for matches and remove them. The stringr
package can help you here
#sample text
text <- "hola, me llamo struggles. El package 'stringr' puede resolver la pregunta."
#normalize text by making everything lowercase
lower_text <- stringr::str_to_lower(text)
#split text at anything that isn't a number or a letter
tokens <- stringr::str_split(lower_text, "[^[:alnum:]]+")
#create a list of stop words
stop_words <- c('la', 'las', 'el', 'ellos')
#remove words that are in the stop words vector
tokens[[1]][!tokens[[1]] %in% stop_words]
Since you'll probably be doing this with a lot of tweets, I suggest you also take a look at the tidytext
package and read through the tutorial https://www.tidytextmining.com/
df <- data.frame(
tweet = text,
tweet_id = 1234,
user = 'struggles',
stringsAsFactors = F
)
twitter_tokens <- tidytext::unnest_tokens(df, word, tweet)
clean_twitter_tokens <- dplyr::filter(twitter_tokens, !word %in% stop_words)
and this will give you something like
tweet_id user word
1 1234 struggles hola
2 1234 struggles me
3 1234 struggles llamo
4 1234 struggles struggles
5 1234 struggles package
6 1234 struggles stringr
7 1234 struggles puede
8 1234 struggles resolver
9 1234 struggles pregunta
And if you want to keep it together in one sentence then the following will bring it back:
lean_twitter_tokens %>%
dplyr::group_by(tweet_id, user) %>%
dplyr::summarize(tweet = stringr::str_c(word, collapse = ' '))
giving you
tweet_id user tweet
<dbl> <chr> <chr>
1 1234 struggles hola me llamo struggles package stringr puede resolver pregunta
Upvotes: 0