Reputation: 87
I am trying in R to find the spanish words in a number of words. I have all the spanish words from a excel that I don´t know how to attach in the post (it has more than 80000 words), and I am trying to check if some words are on it, or not.
For example:
words = c("Silla", "Sillas", "Perro", "asdfg")
I tried to use this solution:
grepl(paste(spanish_words, collapse = "|"), words)
But there is too much spanish words, and gives me this error:
So... who can i do it? I also tried this:
toupper(words) %in% toupper(spanish_words)
As you can see with this option only gives TRUE in exactly matches, and I need that "Sillas" also appear as TRUE (it is the plural word of silla). That was the reason that I tried first with grepl, for get plurals aswell.
Any idea?
Upvotes: 0
Views: 545
Reputation: 151
As df:
df <- tibble(text = c("some words",
"more words",
"Perro",
"And asdfg",
"Comb perro and asdfg"))
Vector of words: words <- c("Silla", "Sillas", "Perro", "asdfg") words <- tolower(paste(words, collapse = "|"))
Then use mutate
and str_detect
:
df %>%
mutate(
text = tolower(text),
spanish_word = str_detect(text, words)
)
Returns:
text spanish_word
<chr> <lgl>
1 some words FALSE
2 more words FALSE
3 perro TRUE
4 and asdfg TRUE
5 comb perro and asdfg TRUE
Upvotes: 1