Reputation: 2719
Example: I have a dataset of heights by gender. I'd like to split the heights into low and high where the cut points are defined as the mean - 2sd within each gender.
example dataset:
set.seed(8)
df = data.frame(sex = c(rep("M",100), rep("F",100)),
ht = c(rnorm(100, mean=1.7, sd=.17), rnorm(100, mean=1.6, sd=.16)))
I'd like to do something in a single line of vectorized code because I'm fairly sure that is possible, however, I do not know how to write it. I imagine that there may be a way to use cut()
, apply()
, and/or dplyr
to achieve this.
Upvotes: 0
Views: 495
Reputation: 2719
Just discovered the following solution using base r:
df$ht_grp <- ave(x = df$ht, df$sex,
FUN = function(x)
cut(x, breaks = c(0, (mean(x, na.rm=T) - 2*sd(x, na.rm=T)), Inf)))
This works because I know that 0 and Inf are reasonable bounds, but I could also use min(x)
, and max(x)
as my upper and lower bounds. This results in a factor variable that is split into low, high, and NA.
My prior solution: I came up with the following two-step process which is not so bad:
df = merge(df,
setNames( aggregate(ht ~ sex, df, FUN = function(x) mean(x)-2*sd(x)),
c("sex", "ht_cutoff")),
by = "sex")
df$ht_is_low = ifelse(df$ht <= df$ht_cutoff, 1, 0)
Upvotes: 0
Reputation: 1644
In the code below, I created 2 new variables. Both were created by grouping the sex
variable and filtering the different ranges of ht
.
library(dplyr)
df_low <- df %>% group_by(sex) %>% filter(ht<(mean(ht)-2*sd(ht)))
df_high<- df %>% group_by(sex) %>% filter(ht>(mean(ht)+2*sd(ht)))
Upvotes: 0
Reputation: 12937
How about this using cut
from base R:
sapply(c("F", "M"), function(s){
dfF <- df[df$sex==s,] # filter out per gender
cut(dfF$ht, breaks = c(0, mean(dfF$ht)-2*sd(dfF$ht), Inf), labels = c("low", "high"))
})
# dfF$ht heights per gender
# mean(dfF$ht)-2*sd(dfF$ht) cut point
Upvotes: 1