ilka
ilka

Reputation: 59

Collapsing multiple factor levels of (messy) character variable in R

I struggle to collapse multiple factor levels into only three factor levels of one specific variable in R Studio.

My point of departure is a data.table with 250 variables and roughly 4,000 rows. For one factor variable I want to collpase it's 75 levels into 3 levels. Moreover, of the 75 levels, 4 levels should be ignored (or set to NA before) since they include controversial information. This factor variable is based on survey answers that also include individual answers in text format. Sometimes even the language differs. So, it's a bit messy.

I tried to collapse these 75 levels (or 71 levels if respective observations set to NA before) into 3 in two different ways. However, R always returns a + instead of a > in the console and I can't continue to perform any other commands. Of course I can stop this by hitting Esc but this does not help me receiving my desired result.

So, this imaginary example should show what I tried:

1) using the levels and list functions

levels(dt$x) <- list("No"=c("I don't allow anything", "..."), 
"Yes"= c("Number of visitors ,annual sales, sales growth, number of customers", "Net sales", "..."), 
"Maybe"=c("The CEO's approval is needed.", "To be discussed"))

2) using the forcats package

dt$x %>%
fct_collapse(No= c("I don't allow anything", "..."), 
Yes= c("Number of visitors ,annual sales, sales growth", "number of customers", "Net sales", "..."), 
Maybe=c("The CEO's approval is needed.", "To be discussed"))

I assume the problem arises due to how the original variable is structured. Does anyone have an idea how I could address that?

A big thank you upfront!

Best, Ilka

Upvotes: 0

Views: 962

Answers (3)

JWilliman
JWilliman

Reputation: 3883

I've written a function xfactor on github to help with exactly this kind of situation. It allows for recoding of factor levels with regex matching, so can be useful for working with messy data. It also allows you to drop factor levels with regex matching using the 'exclude' statement.

devtools::install_github("jwilliman/xfactor")
library(xfactor)

dt$x <- xfactor::xfactor(dt$x, levels = c(
  No = "don't|never",
  Yes = "sales|visitors|customers",
  Maybe = "approval|discuss"),
  exclude = "irrelevant", ignore.case = TRUE)
  )

See https://stackoverflow.com/a/37800944/4241780 for further examples.

By the way, if R is returning a + instead of a > in the console, you have probably missed a closing ) or " somewhere!

Upvotes: 0

ilka
ilka

Reputation: 59

A friend of mine actually provided the answer. It's nothing to do with the data structure.

This does the job:

dt$x <- fct_collapse(dt$x, 
                          No = c(
                            "I don't allow anything", 
                             "..."),
                          Yes= c(
                             "Number of visitors ,annual sales, sales growth",
                             "number of customers", 
                             "Net sales", 
                             "..."),
                          Maybe= c(
                              "The CEO's approval is needed.", 
                              "To be discussed")
                               )

I still don't know why the first option I posted above doesn't work though (it did perfectly well with another variable).

Upvotes: 0

dt$x surely is a "factor". To assign different values to a factor column, you should first convert data to "character" type

class(dt$x) # should be factor
dt$x <- as.character(dt$x)
class(dt$x) # should be "character"  

# a list of collapsed Categories
toCollapseCategories <- list(
    "No"=c("I don't allow anything", "..."), 
    "Yes"= c("Number of visitors ,annual sales, sales growth, number of customers", "Net 
           sales", "..."), 
    "Maybe"=c("The CEO's approval is needed.", "To be discussed")
)

dt$x[dt$x %in% toCollapseCategories$No] <- "No"
dt$x[dt$x %in% toCollapseCategories$Yes] <- "Yes"
dt$x[dt$x %in% toCollapseCategories$Maybe] <- "Maybe"

# and then get a factor
dt$x <- as.factor(dt$x)
class(dt$x) # factor

Of course, code can be optimized but dt$x should be a character in order to replace elements

Upvotes: 0

Related Questions