Btibert3
Btibert3

Reputation: 40186

Case Statement Equivalent in R

I have a variable in a dataframe where one of the fields typically has 7-8 values. I want to collpase them 3 or 4 new categories within a new variable within the dataframe. What is the best approach?

I would use a CASE statement if I were in a SQL-like tool but not sure how to attack this in R.

Any help you can provide will be much appreciated!

Upvotes: 107

Views: 221669

Answers (17)

JohnR
JohnR

Reputation: 1

It is often easier to use a (small) reference table to map the classifications and makes it easier for (non R) people to follow. You can even set up a table to map concatenated variables. You can set up multiple columns as well, obviously.

Upvotes: 0

Deepal
Deepal

Reputation: 21

com = '102'
switch (com,
    '110' = (com= '23279'),
    '101' = (com='23276'),
    '102'= (com = '23277'),
    '111' = (com = '23281'),
    '112' = (com = '23283')
)

print(com)

Upvotes: 2

Evan Cortens
Evan Cortens

Reputation: 790

case_when(), which was added to dplyr in May 2016, solves this problem in a manner similar to memisc::cases().

As of dplyr 0.7.0, for example:

mtcars %>% 
  mutate(category = case_when(
    cyl == 4 & disp < median(disp) ~ "4 cylinders, small displacement",
    cyl == 8 & disp > median(disp) ~ "8 cylinders, large displacement",
    TRUE ~ "other"
  )
)

Original answer

library(dplyr)
mtcars %>% 
  mutate(category = case_when(
    .$cyl == 4 & .$disp < median(.$disp) ~ "4 cylinders, small displacement",
    .$cyl == 8 & .$disp > median(.$disp) ~ "8 cylinders, large displacement",
    TRUE ~ "other"
  )
)

Upvotes: 58

andschar
andschar

Reputation: 3993

As of data.table v1.13.0 you can use the function fcase() (fast-case) to do SQL-like CASE operations (also similar to dplyr::case_when()):

require(data.table)

dt <- data.table(name = c('cow','pig','eagle','pigeon','cow','eagle'))
dt[ , category := fcase(name %in% c('cow', 'pig'), 'mammal',
                        name %in% c('eagle', 'pigeon'), 'bird') ]

Upvotes: 4

adamsss6
adamsss6

Reputation: 319

I see no proposal for 'switch'. Code example (run it):

x <- "three"
y <- 0
switch(x,
       one = {y <- 5},
       two = {y <- 12},
       three = {y <- 432})
y

Upvotes: 31

patrickmdnet
patrickmdnet

Reputation: 3392

You can use the base function merge for case-style remapping tasks:

df <- data.frame(name = c('cow','pig','eagle','pigeon','cow','eagle'), 
                 stringsAsFactors = FALSE)

mapping <- data.frame(
  name=c('cow','pig','eagle','pigeon'),
  category=c('mammal','mammal','bird','bird')
)

merge(df,mapping)
# name category
# 1    cow   mammal
# 2    cow   mammal
# 3  eagle     bird
# 4  eagle     bird
# 5    pig   mammal
# 6 pigeon     bird

Upvotes: 2

petzi
petzi

Reputation: 1636

I am using in those cases you are referring switch(). It looks like a control statement but actually, it is a function. The expression is evaluated and based on this value, the corresponding item in the list is returned.

switch works in two distinct ways depending whether the first argument evaluates to a character string or a number.

What follows is a simple string example which solves your problem to collapse old categories to new ones.

For the character-string form, have a single unnamed argument as the default after the named values.

newCat <- switch(EXPR = category,
       cat1   = catX,
       cat2   = catX,
       cat3   = catY,
       cat4   = catY,
       cat5   = catZ,
       cat6   = catZ,
       "not available")

Upvotes: 8

Henrico
Henrico

Reputation: 2579

Have a look at the cases function from the memisc package. It implements case-functionality with two different ways to use it. From the examples in the package:

z1=cases(
    "Condition 1"=x<0,
    "Condition 2"=y<0,# only applies if x >= 0
    "Condition 3"=TRUE
    )

where x and y are two vectors.

References: memisc package, cases example

Upvotes: 31

Mixing plyr::mutate and dplyr::case_when works for me and is readable.

iris %>%
plyr::mutate(coolness =
     dplyr::case_when(Species  == "setosa"     ~ "not cool",
                      Species  == "versicolor" ~ "not cool",
                      Species  == "virginica"  ~ "super awesome",
                      TRUE                     ~ "undetermined"
       )) -> testIris
head(testIris)
levels(testIris$coolness)  ## NULL
testIris$coolness <- as.factor(testIris$coolness)
levels(testIris$coolness)  ## ok now
testIris[97:103,4:6]

Bonus points if the column can come out of mutate as a factor instead of char! The last line of the case_when statement, which catches all un-matched rows is very important.

     Petal.Width    Species      coolness
 97         1.3  versicolor      not cool
 98         1.3  versicolor      not cool  
 99         1.1  versicolor      not cool
100         1.3  versicolor      not cool
101         2.5  virginica     super awesome
102         1.9  virginica     super awesome
103         2.1  virginica     super awesome

Upvotes: 2

IRTFM
IRTFM

Reputation: 263481

There is a switch statement but I can never seem to get it to work the way I think it should. Since you have not provided an example I will make one using a factor variable:

 dft <-data.frame(x = sample(letters[1:8], 20, replace=TRUE))
 levels(dft$x)
[1] "a" "b" "c" "d" "e" "f" "g" "h"

If you specify the categories you want in an order appropriate to the reassignment you can use the factor or numeric variables as an index:

c("abc", "abc", "abc", "def", "def", "def", "g", "h")[dft$x]
 [1] "def" "h"   "g"   "def" "def" "abc" "h"   "h"   "def" "abc" "abc" "abc" "h"   "h"   "abc"
[16] "def" "abc" "abc" "def" "def"

dft$y <- c("abc", "abc", "abc", "def", "def", "def", "g", "h")[dft$x] str(dft)
'data.frame':   20 obs. of  2 variables:
 $ x: Factor w/ 8 levels "a","b","c","d",..: 4 8 7 4 6 1 8 8 5 2 ...
 $ y: chr  "def" "h" "g" "def" ...

I later learned that there really are two different switch functions. It's not generic function but you should think about it as either switch.numeric or switch.character. If your first argument is an R 'factor', you get switch.numeric behavior, which is likely to cause problems, since most people see factors displayed as character and make the incorrect assumption that all functions will process them as such.

Upvotes: 10

kuba
kuba

Reputation: 1035

If you want to have sql-like syntax you can just make use of sqldf package. Tthe function to be used is also names sqldf and the syntax is as follows

sqldf(<your query in quotation marks>)

Upvotes: 3

Marek
Marek

Reputation: 50783

If you got factor then you could change levels by standard method:

df <- data.frame(name = c('cow','pig','eagle','pigeon'), 
             stringsAsFactors = FALSE)
df$type <- factor(df$name) # First step: copy vector and make it factor
# Change levels:
levels(df$type) <- list(
    animal = c("cow", "pig"),
    bird = c("eagle", "pigeon")
)
df
#     name   type
# 1    cow animal
# 2    pig animal
# 3  eagle   bird
# 4 pigeon   bird

You could write simple function as a wrapper:

changelevels <- function(f, ...) {
    f <- as.factor(f)
    levels(f) <- list(...)
    f
}

df <- data.frame(name = c('cow','pig','eagle','pigeon'), 
                 stringsAsFactors = TRUE)

df$type <- changelevels(df$name, animal=c("cow", "pig"), bird=c("eagle", "pigeon"))

Upvotes: 26

Aaron - mostly inactive
Aaron - mostly inactive

Reputation: 37824

A case statement actually might not be the right approach here. If this is a factor, which is likely is, just set the levels of the factor appropriately.

Say you have a factor with the letters A to E, like this.

> a <- factor(rep(LETTERS[1:5],2))
> a
 [1] A B C D E A B C D E
Levels: A B C D E

To join levels B and C and name it BC, just change the names of those levels to BC.

> levels(a) <- c("A","BC","BC","D","E")
> a
 [1] A  BC BC D  E  A  BC BC D  E 
Levels: A BC D E

The result is as desired.

Upvotes: 2

jamesM
jamesM

Reputation: 59

i dont like any of these, they are not clear to the reader or the potential user. I just use an anonymous function, the syntax is not as slick as a case statement, but the evaluation is similar to a case statement and not that painful. this also assumes your evaluating it within where your variables are defined.

result <- ( function() { if (x==10 | y< 5) return('foo') 
                         if (x==11 & y== 5) return('bar')
                        })()

all of those () are necessary to enclose and evaluate the anonymous function.

Upvotes: 5

Prasad Chalasani
Prasad Chalasani

Reputation: 20282

Here's a way using the switch statement:

df <- data.frame(name = c('cow','pig','eagle','pigeon'), 
                 stringsAsFactors = FALSE)
df$type <- sapply(df$name, switch, 
                  cow = 'animal', 
                  pig = 'animal', 
                  eagle = 'bird', 
                  pigeon = 'bird')

> df
    name   type
1    cow animal
2    pig animal
3  eagle   bird
4 pigeon   bird

The one downside of this is that you have to keep writing the category name (animal, etc) for each item. It is syntactically more convenient to be able to define our categories as below (see the very similar question How do add a column in a data frame in R )

myMap <- list(animal = c('cow', 'pig'), bird = c('eagle', 'pigeon'))

and we want to somehow "invert" this mapping. I write my own invMap function:

invMap <- function(map) {
  items <- as.character( unlist(map) )
  nams <- unlist(Map(rep, names(map), sapply(map, length)))
  names(nams) <- items
  nams
}

and then invert the above map as follows:

> invMap(myMap)
     cow      pig    eagle   pigeon 
"animal" "animal"   "bird"   "bird" 

And then it's easy to use this to add the type column in the data-frame:

df <- transform(df, type = invMap(myMap)[name])

> df
    name   type
1    cow animal
2    pig animal
3  eagle   bird
4 pigeon   bird

Upvotes: 33

Gregory Demin
Gregory Demin

Reputation: 4846

Imho, most straightforward and universal code:

dft=data.frame(x = sample(letters[1:8], 20, replace=TRUE))
dft=within(dft,{
    y=NA
    y[x %in% c('a','b','c')]='abc'
    y[x %in% c('d','e','f')]='def'
    y[x %in% 'g']='g'
    y[x %in% 'h']='h'
})

Upvotes: 17

Ian Fellows
Ian Fellows

Reputation: 17348

You can use recode from the car package:

library(ggplot2) #get data
library(car)
daimons$new_var <- recode(diamonds$clarity , "'I1' = 'low';'SI2' = 'low';else = 'high';")[1:10]

Upvotes: 6

Related Questions