Reputation: 59
I struggle to collapse multiple factor levels into only three factor levels of one specific variable in R Studio.
My point of departure is a data.table
with 250 variables and roughly 4,000 rows. For one factor variable I want to collpase it's 75 levels into 3 levels. Moreover, of the 75 levels, 4 levels should be ignored (or set to NA before) since they include controversial information. This factor variable is based on survey answers that also include individual answers in text format. Sometimes even the language differs. So, it's a bit messy.
I tried to collapse these 75 levels (or 71 levels if respective observations set to NA before) into 3 in two different ways. However, R always returns a +
instead of a >
in the console and I can't continue to perform any other commands. Of course I can stop this by hitting Esc
but this does not help me receiving my desired result.
So, this imaginary example should show what I tried:
1) using the levels
and list
functions
levels(dt$x) <- list("No"=c("I don't allow anything", "..."),
"Yes"= c("Number of visitors ,annual sales, sales growth, number of customers", "Net sales", "..."),
"Maybe"=c("The CEO's approval is needed.", "To be discussed"))
2) using the forcats
package
dt$x %>%
fct_collapse(No= c("I don't allow anything", "..."),
Yes= c("Number of visitors ,annual sales, sales growth", "number of customers", "Net sales", "..."),
Maybe=c("The CEO's approval is needed.", "To be discussed"))
I assume the problem arises due to how the original variable is structured. Does anyone have an idea how I could address that?
A big thank you upfront!
Best, Ilka
Upvotes: 0
Views: 962
Reputation: 3883
I've written a function xfactor
on github to help with exactly this kind of situation. It allows for recoding of factor levels with regex matching, so can be useful for working with messy data. It also allows you to drop factor levels with regex matching using the 'exclude' statement.
devtools::install_github("jwilliman/xfactor")
library(xfactor)
dt$x <- xfactor::xfactor(dt$x, levels = c(
No = "don't|never",
Yes = "sales|visitors|customers",
Maybe = "approval|discuss"),
exclude = "irrelevant", ignore.case = TRUE)
)
See https://stackoverflow.com/a/37800944/4241780 for further examples.
By the way, if R is returning a +
instead of a >
in the console, you have probably missed a closing )
or "
somewhere!
Upvotes: 0
Reputation: 59
A friend of mine actually provided the answer. It's nothing to do with the data structure.
This does the job:
dt$x <- fct_collapse(dt$x,
No = c(
"I don't allow anything",
"..."),
Yes= c(
"Number of visitors ,annual sales, sales growth",
"number of customers",
"Net sales",
"..."),
Maybe= c(
"The CEO's approval is needed.",
"To be discussed")
)
I still don't know why the first option I posted above doesn't work though (it did perfectly well with another variable).
Upvotes: 0
Reputation: 306
dt$x surely is a "factor". To assign different values to a factor column, you should first convert data to "character" type
class(dt$x) # should be factor
dt$x <- as.character(dt$x)
class(dt$x) # should be "character"
# a list of collapsed Categories
toCollapseCategories <- list(
"No"=c("I don't allow anything", "..."),
"Yes"= c("Number of visitors ,annual sales, sales growth, number of customers", "Net
sales", "..."),
"Maybe"=c("The CEO's approval is needed.", "To be discussed")
)
dt$x[dt$x %in% toCollapseCategories$No] <- "No"
dt$x[dt$x %in% toCollapseCategories$Yes] <- "Yes"
dt$x[dt$x %in% toCollapseCategories$Maybe] <- "Maybe"
# and then get a factor
dt$x <- as.factor(dt$x)
class(dt$x) # factor
Of course, code can be optimized but dt$x should be a character in order to replace elements
Upvotes: 0