iomedee
iomedee

Reputation: 393

Grouping data by custom range in R (for example, 0-4, 1-5, 2-6, 3-7 and etc.)

I'm using Titanic dataset from https://www.kaggle.com/c/titanic.

I want to make different age groups like "0-4", "1-5", "2-6", "3-7" and etc. and find out the age group where the percentage of survived is the highest. My age groups are within the integer interval [0,80]. The column "Age" in the original dataset also includes NAs. The column "Survived" contains the information if the person was survived or not (0 = No, 1 = Yes).

I was trying to solve this problem like this but it didn't work. I will be grateful for any help.

for(i in 0:80){
max= -Inf
x[i]<-(sum(subset(dataset, Age < (i+5) & Age >= i, select = "Survived")))/(length(which(dataset$Age < (i+5) & dataset$Age>= i)))
if (x[i] > max) max <- x[i]
return(max, i, i+5)}

Upvotes: 0

Views: 709

Answers (3)

Uwe
Uwe

Reputation: 42564

The OP has requested to

find out the age group where the percentage of survived is the highest

There is one possible solution which uses a non-equi join:

library(data.table)
library(titanic)
dataset <- as.data.table(titanic_train)
delta <- 4
max_age <- 80
dataset[
  .(lower = seq(0, max_age - delta), upper = seq(delta, max_age)), 
  on = .(Age >= lower, Age <= upper), .SD[, .N, by = Survived], by = .EACHI][
    , total.N := sum(N), by = Age][, share := N / total.N][]
     Age Age Survived  N total.N     share
  1:   0   4        1 27      40 0.6750000
  2:   0   4        0 13      40 0.3250000
  3:   1   5        0 13      37 0.3513514
  4:   1   5        1 24      37 0.6486486
  5:   2   6        0 12      33 0.3636364
 ---                                      
137:  72  76        0  1       1 1.0000000
138:  73  77        0  1       1 1.0000000
139:  74  78        0  1       1 1.0000000
140:  75  79       NA  1       1 1.0000000
141:  76  80        1  1       1 1.0000000

The result so far shows that age groups with only one passenger who has survived have the highest survival rate of 100%. This is trivial and probably not the result the OP is looking for. Either delta has to be enlarged or a threshold on the absolute number of passengers per age group has to be applied, e.g., total.N > 2.

Upvotes: 2

David Klotz
David Klotz

Reputation: 2431

One approach is to tabulate everything by age, then turn those into groups. Then use the lag() function recursively to add each set of five rows. Assuming the dataframe (df) has two columns, Age and Survived (0 or 1):

library(dplyr)

df %>% group_by(Age) %>% 
  summarize(n = n(), Survived = sum(Survived)) %>% 
  arrange(Age) %>% 
  mutate(Age_Group = paste0(Age-4, "-", Age)) %>% 
  mutate_at(vars(n, Survived), 
  funs(. + lag(.) + lag(lag(.)) + lag(lag(lag(.))) + lag(lag(lag(lag(.)))))) %>% 
  mutate(SurvivalRate = Survived/n) %>% 
  filter(!is.na(n)) %>% 
  select(-Age)

Upvotes: 0

Rui Barradas
Rui Barradas

Reputation: 76495

First I will make up some data.

set.seed(1234)   # make it reproducible
Age <- sample(c(NA, 1:80), 200, TRUE)
Survived <- sample(0:1, 200, TRUE)

Now the code. Define a function to cut the input into groups, and lapply that function to each group start, from 0 to 3.

fun <- function(i, x){
    as.character(cut(x, breaks = seq(i, 80, by = 5), include.lowest = TRUE))
}

res <- unlist(lapply(0:3, fun, Age))
table(res)

res_surv <- unlist(lapply(0:3, fun, Age[Survived == 1]))
table(res_surv)

If you want percentages, you can do it with

tbl_surv <- table(res_surv)
100*tbl_surv/sum(tbl_surv)

Upvotes: 2

Related Questions