Tom
Tom

Reputation: 2341

Factor levels by group

I have a data.table that looks as follows:

library(data.table)
dt <- fread(
    "Sex   Height   
     M   180   
     F   179
     F   162   
     M   181  
     M   165   
     M   178   
     F   172   
     F   160",
  header = TRUE
)

I would like to split the height into groups. However, I want separate groups for men and women. The following code gives me three factor level, where I would like six.

dt[,height_f := cut(Height, breaks = c(0, 165, 180, 300), right = FALSE), by="Sex"]

> table(dt$height_f)

  [0,165) [165,180) [180,300) 
        2         4         2

I have the feeling that it should be something very simple, but I cannot figure out how to write it.

Desired output:

> table(dt$height_f)

  M:[0,165) M:[165,180) M:[180,300) F:[0,165) F:[165,180) F:[180,300) 
        0         3          1            2         2         0

Upvotes: 3

Views: 486

Answers (2)

s_baldur
s_baldur

Reputation: 33498

A data.table solution:

dt[, height_cat := cut(Height, breaks = c(0, 165, 180, 300), right = FALSE)]
dt[, height_f := 
       factor(
         paste(Sex, height_cat, sep = ":"), 
         levels = dt[, CJ(Sex, height_cat, unique = TRUE)][, paste(Sex, height_cat, sep = ":")]
       )]

table(dt$height_f)
# F:[0,165) F:[165,180) F:[180,300)   M:[0,165) M:[165,180) M:[180,300) 
#         2           2           0           0           2           2 

Upvotes: 2

heds1
heds1

Reputation: 3438

This might be appropriate. We don't end up using table to show the output, although I think the tibble output is probably more useful anyway:

library(dplyr)

dt %>%
    mutate(Height = cut(Height, breaks = c(0, 166, 181, 301))) %>%
    group_by(Sex, Height, .drop = FALSE) %>%
    summarise(n = n())

## A tibble: 6 x 3
## Groups:   Sex [2]
#  Sex   Height        n
#  <chr> <fct>     <int>
#1 F     (0,166]       2
#2 F     (166,181]     2
#3 F     (181,301]     0
#4 M     (0,166]       1
#5 M     (166,181]     3
#6 M     (181,301]     0

Note that the breaks argument can be read as "up until this number", so to get your desired output we need to add 1 to each integer (that is, breaks = c(0, 166, 181, 301). We also need to specify .drop = FALSE if we want the empty groups to show up like in your desired output (this defaults to TRUE).

Upvotes: 0

Related Questions