Reputation: 60
I want to perform a text-classfication with many (>50K) tokens as feature names. However the Task()
functions in mlr3
do not allow many characters in column names, which are passed by make.names
and are otherwise fine. Here is a list of them that I found so far:
mutate(token=str_replace(token, "à", "a")) %>%
mutate(token=str_replace(token, "ã", "a")) %>%
mutate(token=str_replace(token, "á", "a")) %>%
mutate(token=str_replace(token, "ø", "o")) %>%
mutate(token=str_replace(token, "ç", "c")) %>%
mutate(token=str_replace(token, "ô", "o")) %>%
mutate(token=str_replace(token, "é", "e")) %>%
mutate(token=str_replace(token, "é", "e")) %>%
mutate(token=str_replace(token, "í", "i")) %>%
mutate(token=str_replace(token, "î", "i")) %>%
mutate(token=str_replace(token, "è", "e")) %>%
mutate(token=str_replace(token, "ë", "e")) %>%
mutate(token=str_replace(token, "å", "a")) %>%
mutate(token=str_replace(token, "â", "a")) %>%
mutate(token=str_replace(token, "æ", "a")) %>%
mutate(token=str_replace(token, "ñ", "n")) %>%
How do I make my data.frame compatible with mlr3
, without manually replacing all special characters this way (trial and error)? make.names()
does obviously not work!
I would very much appreciate some help :) Thanks!
Upvotes: 2
Views: 509
Reputation: 655
Using the janitor
package is one option. Base R also comes with (the less sophisticated) function make.names(names, unique = TRUE)
which also works fine.
If you really need to keep the original names, you can set the experimental option "mlr3.allow_utf8_names"
to TRUE
, but be aware that this might break some learners.
Upvotes: 0
Reputation: 10192
One way to do it is to use janitor::clean_names()
d <- data.frame(`süßigkeit` = 1:3, `straße` = 1:3, `Hellö` = 1:3, `séé` = 1:3)
janitor::clean_names(d)
#> sussigkeit strasse hello see
#> 1 1 1 1 1
#> 2 2 2 2 2
#> 3 3 3 3 3
Created on 2021-01-11 by the reprex package (v0.3.0)
If you're processing a vector, not names of a data.frame, you could use the underlying function janitor::make_clean_names()
:
make_clean_names("süßigkeit")
[1] "sussigkeit"
Upvotes: 6