Reputation: 393
I'm using Titanic dataset from https://www.kaggle.com/c/titanic.
I want to make different age groups like "0-4", "1-5", "2-6", "3-7" and etc. and find out the age group where the percentage of survived is the highest. My age groups are within the integer interval [0,80]. The column "Age" in the original dataset also includes NAs. The column "Survived" contains the information if the person was survived or not (0 = No, 1 = Yes).
I was trying to solve this problem like this but it didn't work. I will be grateful for any help.
for(i in 0:80){
max= -Inf
x[i]<-(sum(subset(dataset, Age < (i+5) & Age >= i, select = "Survived")))/(length(which(dataset$Age < (i+5) & dataset$Age>= i)))
if (x[i] > max) max <- x[i]
return(max, i, i+5)}
Upvotes: 0
Views: 709
Reputation: 42564
The OP has requested to
find out the age group where the percentage of survived is the highest
There is one possible solution which uses a non-equi join:
library(data.table)
library(titanic)
dataset <- as.data.table(titanic_train)
delta <- 4
max_age <- 80
dataset[
.(lower = seq(0, max_age - delta), upper = seq(delta, max_age)),
on = .(Age >= lower, Age <= upper), .SD[, .N, by = Survived], by = .EACHI][
, total.N := sum(N), by = Age][, share := N / total.N][]
Age Age Survived N total.N share 1: 0 4 1 27 40 0.6750000 2: 0 4 0 13 40 0.3250000 3: 1 5 0 13 37 0.3513514 4: 1 5 1 24 37 0.6486486 5: 2 6 0 12 33 0.3636364 --- 137: 72 76 0 1 1 1.0000000 138: 73 77 0 1 1 1.0000000 139: 74 78 0 1 1 1.0000000 140: 75 79 NA 1 1 1.0000000 141: 76 80 1 1 1 1.0000000
The result so far shows that age groups with only one passenger who has survived have the highest survival rate of 100%. This is trivial and probably not the result the OP is looking for. Either delta
has to be enlarged or a threshold on the absolute number of passengers per age group has to be applied, e.g., total.N > 2
.
Upvotes: 2
Reputation: 2431
One approach is to tabulate everything by age, then turn those into groups. Then use the lag()
function recursively to add each set of five rows. Assuming the dataframe (df
) has two columns, Age
and Survived
(0 or 1):
library(dplyr)
df %>% group_by(Age) %>%
summarize(n = n(), Survived = sum(Survived)) %>%
arrange(Age) %>%
mutate(Age_Group = paste0(Age-4, "-", Age)) %>%
mutate_at(vars(n, Survived),
funs(. + lag(.) + lag(lag(.)) + lag(lag(lag(.))) + lag(lag(lag(lag(.)))))) %>%
mutate(SurvivalRate = Survived/n) %>%
filter(!is.na(n)) %>%
select(-Age)
Upvotes: 0
Reputation: 76495
First I will make up some data.
set.seed(1234) # make it reproducible
Age <- sample(c(NA, 1:80), 200, TRUE)
Survived <- sample(0:1, 200, TRUE)
Now the code. Define a function to cut
the input into groups, and lapply
that function to each group start, from 0
to 3
.
fun <- function(i, x){
as.character(cut(x, breaks = seq(i, 80, by = 5), include.lowest = TRUE))
}
res <- unlist(lapply(0:3, fun, Age))
table(res)
res_surv <- unlist(lapply(0:3, fun, Age[Survived == 1]))
table(res_surv)
If you want percentages, you can do it with
tbl_surv <- table(res_surv)
100*tbl_surv/sum(tbl_surv)
Upvotes: 2